arkouda.dataframe¶
Classes¶
A DataFrame structure based on arkouda arrays. |
|
A DataFrame that has been grouped by a subset of columns. |
|
A column in a GroupBy that has been differenced. |
Functions¶
|
Find the intersection of two arkouda arrays. |
|
Find all the rows that are in both dataframes. |
|
Find the inverse of a permutation array. |
|
Merge Arkouda DataFrames with a database-style join. |
Module Contents¶
- class arkouda.dataframe.DataFrame(dict=None, /, **kwargs)[source]¶
Bases:
collections.UserDict
A DataFrame structure based on arkouda arrays.
- Parameters:
initialdata (List or dictionary of lists, tuples, or pdarrays) – Each list/dictionary entry corresponds to one column of the data and should be a homogenous type. Different columns may have different types. If using a dictionary, keys should be strings.
index (Index, pdarray, or Strings) – Index for the resulting frame. Defaults to an integer range.
columns (List, tuple, pdarray, or Strings) – Column labels to use if the data does not include them. Elements must be strings. Defaults to an stringified integer range.
Examples
Create an empty DataFrame and add a column of data:
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame() >>> df['a'] = ak.array([1,2,3]) >>> display(df)
a
0
1
1
2
2
3
Create a new DataFrame using a dictionary of data:
>>> userName = ak.array(['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice']) >>> userID = ak.array([111, 222, 111, 333, 222, 111]) >>> item = ak.array([0, 0, 1, 1, 2, 0]) >>> day = ak.array([5, 5, 6, 5, 6, 6]) >>> amount = ak.array([0.5, 0.6, 1.1, 1.2, 4.3, 0.6]) >>> df = ak.DataFrame({'userName': userName, 'userID': userID, >>> 'item': item, 'day': day, 'amount': amount}) >>> display(df)
userName
userID
item
day
amount
0
Alice
111
0
5
0.5
1
Bob
222
0
5
0.6
2
Alice
111
1
6
1.1
3
Carol
333
1
5
1.2
4
Bob
222
2
6
4.3
5
Alice
111
0
6
0.6
Indexing works slightly differently than with pandas:
>>> df[0]
keys
values
userName
Alice
userID
111
item
0
day
5
amount
0.5
>>> df['userID'] array([111, 222, 111, 333, 222, 111])
>>> df['userName'] array(['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice'])
>>> df[ak.array([1,3,5])]
userName
userID
item
day
amount
0
Bob
222
0
5
0.6
1
Carol
333
1
5
1.2
2
Alice
111
0
6
0.6
Compute the stride:
>>> df[1:5:1]
userName
userID
item
day
amount
0
Bob
222
0
5
0.6
1
Alice
111
1
6
1.1
2
Carol
333
1
5
1.2
3
Bob
222
2
6
4.3
>>> df[ak.array([1,2,3])]
userName
userID
item
day
amount
0
Bob
222
0
5
0.6
1
Alice
111
1
6
1.1
2
Carol
333
1
5
1.2
>>> df[['userID', 'day']]
userID
day
0
111
5
1
222
5
2
111
6
3
333
5
4
222
6
5
111
6
- GroupBy(keys, use_series=False, as_index=True, dropna=True)[source]¶
Group the dataframe by a column or a list of columns.
- Parameters:
keys (str or list of str) – An (ordered) list of column names or a single string to group by.
use_series (bool, default=False) – If True, returns an arkouda.dataframe.DataFrameGroupBy object. Otherwise an arkouda.groupbyclass.GroupBy object.
as_index (bool, default=True) – If True, groupby columns will be set as index otherwise, the groupby columns will be treated as DataFrame columns.
dropna (bool, default=True) – If True, and the groupby keys contain NaN values, the NaN values together with the corresponding row will be dropped. Otherwise, the rows corresponding to NaN values will be kept.
- Returns:
If use_series = True, returns an arkouda.dataframe.DataFrameGroupBy object. Otherwise returns an arkouda.groupbyclass.GroupBy object.
- Return type:
arkouda.dataframe.DataFrameGroupBy or arkouda.groupbyclass.GroupBy
See also
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1.0, 1.0, 2.0, np.nan], 'col2': [4, 5, 6, 7]}) >>> df
col1
col2
0
1
4
1
1
5
2
2
6
3
nan
7
>>> df.GroupBy("col1") <arkouda.groupbyclass.GroupBy at 0x7f2cf23e10c0> >>> df.GroupBy("col1").size() (array([1.00000000000000000 2.00000000000000000]), array([2 1]))
>>> df.GroupBy("col1",use_series=True) col1 1.0 2 2.0 1 dtype: int64 >>> df.GroupBy("col1",use_series=True, as_index = False).size()
col1
size
0
1
2
1
2
1
- all(axis=0) Series | bool [source]¶
Return whether all elements are True, potentially over an axis.
Returns True unless there at least one element along a Dataframe axis that is False.
Currently, will ignore any columns that are not type bool. This is equivalent to the pandas option bool_only=True.
- Parameters:
axis ({0 or ‘index’, 1 or ‘columns’, None}, default = 0) –
Indicate which axis or axes should be reduced.
0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
None : reduce all axes, return a scalar.
- Return type:
arkouda.series.Series or bool
- Raises:
ValueError – Raised if axis does not have a value in {0 or ‘index’, 1 or ‘columns’, None}.
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({"A":[True,True,True,False],"B":[True,True,True,False], ... "C":[True,False,True,False],"D":[True,True,True,True]})
A
B
C
D
0
True
True
True
True
1
True
True
False
True
2
True
True
True
True
3
False
False
False
True
>>> df.all(axis=0) A False B False C False D True dtype: bool >>> df.all(axis=1) 0 True 1 False 2 True 3 False dtype: bool >>> df.all(axis=None) False
- any(axis=0) Series | bool [source]¶
Return whether any element is True, potentially over an axis.
Returns False unless there is at least one element along a Dataframe axis that is True.
Currently, will ignore any columns that are not type bool. This is equivalent to the pandas option bool_only=True.
- Parameters:
axis ({0 or ‘index’, 1 or ‘columns’, None}, default = 0) –
Indicate which axis or axes should be reduced.
0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
None : reduce all axes, return a scalar.
- Return type:
arkouda.series.Series or bool
- Raises:
ValueError – Raised if axis does not have a value in {0 or ‘index’, 1 or ‘columns’, None}.
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({"A":[True,True,True,False],"B":[True,True,True,False], ... "C":[True,False,True,False],"D":[False,False,False,False]})
A
B
C
D
0
True
True
True
False
1
True
True
False
False
2
True
True
True
False
3
False
False
False
False
>>> df.any(axis=0) A True B True C True D False dtype: bool >>> df.any(axis=1) 0 True 1 True 2 True 3 False dtype: bool >>> df.any(axis=None) True
- append(other, ordered=True)[source]¶
Concatenate data from ‘other’ onto the end of this DataFrame, in place.
Explicitly, use the arkouda concatenate function to append the data from each column in other to the end of self. This operation is done in place, in the sense that the underlying pdarrays are updated from the result of the arkouda concatenate function, rather than returning a new DataFrame object containing the result.
- Parameters:
other (DataFrame) – The DataFrame object whose data will be appended to this DataFrame.
ordered (bool, default=True) – If False, allow rows to be interleaved for better performance (but data within a row remains together). By default, append all rows to the end, in input order.
- Returns:
Appending occurs in-place, but result is returned for compatibility.
- Return type:
self
Examples
>>> import arkouda as ak >>> ak.connect() >>> df1 = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
col1
col2
0
1
3
1
2
4
>>> df2 = ak.DataFrame({'col1': [3], 'col2': [5]})
col1
col2
0
3
5
>>> df1.append(df2) >>> df1
col1
col2
0
1
3
1
2
4
2
3
5
- apply_permutation(perm)[source]¶
Apply a permutation to an entire DataFrame. The operation is done in place and the original DataFrame will be modified.
This may be useful if you want to unsort an DataFrame, or even to apply an arbitrary permutation such as the inverse of a sorting permutation.
- Parameters:
perm (pdarray) – A permutation array. Should be the same size as the data arrays, and should consist of the integers [0,size-1] in some order. Very minimal testing is done to ensure this is a permutation.
- Return type:
None
See also
sort
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
col1
col2
0
1
4
1
2
5
2
3
6
>>> perm_arry = ak.array([0, 2, 1]) >>> df.apply_permutation(perm_arry) >>> display(df)
col1
col2
0
1
4
1
3
6
2
2
5
- argsort(key, ascending=True)[source]¶
Return the permutation that sorts the dataframe by key.
- Parameters:
key (str) – The key to sort on.
ascending (bool, default = True) – If true, sort the key in ascending order. Otherwise, sort the key in descending order.
- Returns:
The permutation array that sorts the data on key.
- Return type:
See also
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1.1, 3.1, 2.1], 'col2': [6, 5, 4]}) >>> display(df)
col1
col2
0
1.1
6
1
3.1
5
2
2.1
4
>>> df.argsort('col1') array([0 2 1]) >>> sorted_df1 = df[df.argsort('col1')] >>> display(sorted_df1)
col1
col2
0
1.1
6
1
2.1
4
2
3.1
5
>>> df.argsort('col2') array([2 1 0]) >>> sorted_df2 = df[df.argsort('col2')] >>> display(sorted_df2)
col1
col2
0
2.1
4
1
3.1
5
2
1.1
6
- assign(**kwargs) DataFrame [source]¶
Assign new columns to a DataFrame.
Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.
- Parameters:
**kwargs (dict of {str: callable or Series}) – The column names are keywords. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn’t check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned.
- Returns:
A new DataFrame with the new columns in addition to all the existing columns.
- Return type:
Notes
Assigning multiple columns within the same
assign
is possible. Later items in ‘**kwargs’ may refer to newly created or modified columns in ‘df’; items are computed and assigned into ‘df’ in order.Examples
>>> df = ak.DataFrame({'temp_c': [17.0, 25.0]}, ... index=['Portland', 'Berkeley']) >>> df temp_c Portland 17.0 Berkeley 25.0
Where the value is a callable, evaluated on df:
>>> df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32) temp_c temp_f Portland 17.0 62.6 Berkeley 25.0 77.0
Alternatively, the same behavior can be achieved by directly referencing an existing Series or sequence:
>>> df.assign(temp_f=df['temp_c'] * 9 / 5 + 32) temp_c temp_f Portland 17.0 62.6 Berkeley 25.0 77.0
You can create multiple columns within the same assign where one of the columns depends on another one defined within the same assign:
>>> df.assign(temp_f=lambda x: x['temp_c'] * 9 / 5 + 32, ... temp_k=lambda x: (x['temp_f'] + 459.67) * 5 / 9) temp_c temp_f temp_k Portland 17.0 62.6 290.15 Berkeley 25.0 77.0 298.15
- attach(user_defined_name: str) DataFrame [source]¶
Function to return a DataFrame object attached to the registered name in the arkouda server which was registered using register().
- Parameters:
user_defined_name (str) – user defined name which DataFrame object was registered under.
- Returns:
The DataFrame object created by re-attaching to the corresponding server components.
- Return type:
- Raises:
RegistrationError – if user_defined_name is not registered
See also
Example
>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) >>> df.register("my_table_name") >>> df.attach("my_table_name") >>> df.is_registered() True >>> df.unregister() >>> df.is_registered() False
- coargsort(keys, ascending=True)[source]¶
Return the permutation that sorts the dataframe by keys.
Note: Sorting using Strings may not yield correct sort order.
- Parameters:
keys (list of str) – The keys to sort on.
- Returns:
The permutation array that sorts the data on keys.
- Return type:
Example
>>> df = ak.DataFrame({'col1': [2, 2, 1], 'col2': [3, 4, 3], 'col3':[5, 6, 7]}) >>> display(df)
col1
col2
col3
0
2
3
5
1
2
4
6
2
1
3
7
>>> df.coargsort(['col1', 'col2']) array([2 0 1]) >>>
- property columns¶
An Index where the values are the column names of the dataframe.
- Returns:
The values of the index are the column names of the dataframe.
- Return type:
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> df
col1
col2
0
1
3
1
2
4
>>> df.columns Index(array(['col1', 'col2']), dtype='<U0')
- corr() DataFrame [source]¶
Return new DataFrame with pairwise correlation of columns.
- Returns:
Arkouda DataFrame containing correlation matrix of all columns.
- Return type:
- Raises:
RuntimeError – Raised if there’s a server-side error thrown.
See also
pdarray.corr
Notes
Generates the correlation matrix using Pearson R for all columns.
Attempts to convert to numeric values where possible for inclusion in the matrix.
Example
>>> df = ak.DataFrame({'col1': [1, 2], 'col2': [-1, -2]}) >>> display(df)
col1
col2
0
1
-1
1
2
-2
>>> corr = df.corr()
col1
col2
col1
1
-1
col2
-1
1
- count(axis: int | str = 0, numeric_only=False) Series [source]¶
Count non-NA cells for each column or row.
The values np.NaN are considered NA.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row.
numeric_only (bool = False) – Include only float, int or boolean data.
- Returns:
For each column/row the number of non-NA/null entries.
- Return type:
- Raises:
ValueError – Raised if axis is not 0, 1, ‘index’, or ‘columns’.
See also
GroupBy.count
Examples
>>> import arkouda as ak >>> ak.connect() >>> import numpy as np >>> df = ak.DataFrame({'col_A': ak.array([7, np.nan]), 'col_B':ak.array([1, 9])}) >>> display(df)
col_A
col_B
0
7
1
1
nan
9
>>> df.count() col_A 1 col_B 2 dtype: int64
>>> df = ak.DataFrame({'col_A': ak.array(["a","b","c"]), 'col_B':ak.array([1, np.nan, np.nan])}) >>> display(df)
col_A
col_B
0
a
1
1
b
nan
2
c
nan
>>> df.count() col_A 3 col_B 1 dtype: int64
>>> df.count(numeric_only=True) col_B 1 dtype: int64
>>> df.count(axis=1) 0 2 1 1 2 1 dtype: int64
- drop(keys: str | int | List[str | int], axis: str | int = 0, inplace: bool = False) None | DataFrame [source]¶
Drop column/s or row/s from the dataframe.
- Parameters:
keys (str, int or list) – The labels to be dropped on the given axis.
axis (int or str) – The axis on which to drop from. 0/’index’ - drop rows, 1/’columns’ - drop columns.
inplace (bool, default=False) – When True, perform the operation on the calling object. When False, return a new object.
- Returns:
DateFrame when inplace=False; None when inplace=True
- Return type:
arkouda.dataframe.DataFrame or None
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> display(df)
col1
col2
0
1
3
1
2
4
Drop column
>>> df.drop('col1', axis = 1)
col2
0
3
1
4
Drop row
>>> df.drop(0, axis = 0)
col1
col2
0
2
4
- drop_duplicates(subset=None, keep='first')[source]¶
Drops duplcated rows and returns resulting DataFrame.
If a subset of the columns are provided then only one instance of each duplicated row will be returned (keep determines which row).
- Parameters:
subset (Iterable) – Iterable of column names to use to dedupe.
keep ({'first', 'last'}, default='first') – Determines which duplicates (if any) to keep.
- Returns:
DataFrame with duplicates removed.
- Return type:
Example
>>> df = ak.DataFrame({'col1': [1, 2, 2, 3], 'col2': [4, 5, 5, 6]}) >>> display(df)
col1
col2
0
1
4
1
2
5
2
2
5
3
3
6
>>> df.drop_duplicates()
col1
col2
0
1
4
1
2
5
2
3
6
- dropna(axis: int | str = 0, how: str | None = None, thresh: int | None = None, ignore_index: bool = False) DataFrame [source]¶
Remove missing values.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default = 0) –
Determine if rows or columns which contain missing values are removed.
0, or ‘index’: Drop rows which contain missing values.
1, or ‘columns’: Drop columns which contain missing value.
Only a single axis is allowed.
how ({'any', 'all'}, default='any') –
Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
’any’: If any NA values are present, drop that row or column.
’all’: If all values are NA, drop that row or column.
thresh (int, optional) – Require that many non - NA values.Cannot be combined with how.
ignore_index (bool, default
False
) – IfTrue
, the resulting axis will be labeled 0, 1, …, n - 1.
- Returns:
DataFrame with NA entries dropped from it.
- Return type:
Examples
>>> import arkouda as ak >>> ak.connect() >>> import numpy as np >>> df = ak.DataFrame( { "A": [True, True, True, True], "B": [1, np.nan, 2, np.nan], "C": [1, 2, 3, np.nan], "D": [False, False, False, False], "E": [1, 2, 3, 4], "F": ["a", "b", "c", "d"], "G": [1, 2, 3, 4], } )
>>> display(df)
A
B
C
D
E
F
G
0
True
1
1
False
1
a
1
1
True
nan
2
False
2
b
2
2
True
2
3
False
3
c
3
3
True
nan
nan
False
4
d
4
>>> df.dropna()
A
B
C
D
E
F
G
0
True
1
1
False
1
a
1
1
True
2
3
False
3
c
3
>>> df.dropna(axis=1)
A
D
E
F
G
0
True
False
1
a
1
1
True
False
2
b
2
2
True
False
3
c
3
3
True
False
4
d
4
>>> df.dropna(axis=1, thresh=3)
A
C
D
E
F
G
0
True
1
False
1
a
1
1
True
2
False
2
b
2
2
True
3
False
3
c
3
3
True
nan
False
4
d
4
>>> df.dropna(axis=1, how="all")
A
B
C
D
E
F
G
0
True
1
1
False
1
a
1
1
True
nan
2
False
2
b
2
2
True
2
3
False
3
c
3
3
True
nan
nan
False
4
d
4
- property dtypes: DataFrame¶
The dtypes of the dataframe.
- Returns:
dtypes – The dtypes of the dataframe.
- Return type:
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1, 2], 'col2': ["a", "b"]}) >>> df
col1
col2
0
1
a
1
2
b
>>> df.dtypes
keys
values
col1
int64
col2
str
- property empty: DataFrame¶
Whether the dataframe is empty.
- Returns:
True if the dataframe is empty, otherwise False.
- Return type:
bool
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({}) >>> df 0 rows x 0 columns >>> df.empty True
- filter_by_range(keys, low=1, high=None)[source]¶
Find all rows where the value count of the items in a given set of columns (keys) is within the range [low, high].
To filter by a specific value, set low == high.
- Parameters:
keys (str or list of str) – The names of the columns to group by.
low (int, default=1) – The lowest value count.
high (int, default=None) – The highest value count, default to unlimited.
- Returns:
An array of boolean values for qualified rows in this DataFrame.
- Return type:
Example
>>> df = ak.DataFrame({'col1': [1, 2, 2, 2, 3, 3], 'col2': [4, 5, 6, 7, 8, 9]}) >>> display(df)
col1
col2
0
1
4
1
2
5
2
2
6
3
2
7
4
3
8
5
3
9
>>> df.filter_by_range("col1", low=1, high=2) array([True False False False True True])
>>> filtered_df = df[df.filter_by_range("col1", low=1, high=2)] >>> display(filtered_df)
col1
col2
0
1
4
1
3
8
2
3
9
- from_pandas(pd_df)[source]¶
Copy the data from a pandas DataFrame into a new arkouda.dataframe.DataFrame.
- Parameters:
pd_df (pandas.DataFrame) – A pandas DataFrame to convert.
- Return type:
Examples
>>> import arkouda as ak >>> ak.connect() >>> import pandas as pd >>> pd_df = pd.DataFrame({"A":[1,2],"B":[3,4]}) >>> type(pd_df) pandas.core.frame.DataFrame >>> display(pd_df)
A
B
0
1
3
1
2
4
>>> ak_df = DataFrame.from_pandas(pd_df) >>> type(ak_df) arkouda.dataframe.DataFrame >>> display(ak_df)
A
B
0
1
3
1
2
4
- from_return_msg(rep_msg)[source]¶
Creates a DataFrame object from an arkouda server response message.
- Parameters:
rep_msg (string) – Server response message used to create a DataFrame.
- Return type:
- groupby(keys, use_series=True, as_index=True, dropna=True)[source]¶
Group the dataframe by a column or a list of columns. Alias for GroupBy.
- Parameters:
keys (str or list of str) – An (ordered) list of column names or a single string to group by.
use_series (bool, default=True) – If True, returns an arkouda.dataframe.DataFrameGroupBy object. Otherwise an arkouda.groupbyclass.GroupBy object.
as_index (bool, default=True) – If True, groupby columns will be set as index otherwise, the groupby columns will be treated as DataFrame columns.
dropna (bool, default=True) – If True, and the groupby keys contain NaN values, the NaN values together with the corresponding row will be dropped. Otherwise, the rows corresponding to NaN values will be kept.
- Returns:
If use_series = True, returns an arkouda.dataframe.DataFrameGroupBy object. Otherwise returns an arkouda.groupbyclass.GroupBy object.
- Return type:
arkouda.dataframe.DataFrameGroupBy or arkouda.groupbyclass.GroupBy
See also
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1.0, 1.0, 2.0, np.nan], 'col2': [4, 5, 6, 7]}) >>> df
col1
col2
0
1
4
1
1
5
2
2
6
3
nan
7
>>> df.GroupBy("col1") <arkouda.groupbyclass.GroupBy at 0x7f2cf23e10c0> >>> df.GroupBy("col1").size() (array([1.00000000000000000 2.00000000000000000]), array([2 1]))
>>> df.GroupBy("col1",use_series=True) col1 1.0 2 2.0 1 dtype: int64 >>> df.GroupBy("col1",use_series=True, as_index = False).size()
col1
size
0
1
2
1
2
1
- head(n=5)[source]¶
Return the first n rows.
This function returns the first n rows of the the dataframe. It is useful for quickly verifying data, for example, after sorting or appending rows.
- Parameters:
n (int, default = 5) – Number of rows to select.
- Returns:
The first n rows of the DataFrame.
- Return type:
See also
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': ak.arange(10), 'col2': -1 * ak.arange(10)}) >>> display(df)
col1
col2
0
0
0
1
1
-1
2
2
-2
3
3
-3
4
4
-4
5
5
-5
6
6
-6
7
7
-7
8
8
-8
9
9
-9
>>> df.head()
col1
col2
0
0
0
1
1
-1
2
2
-2
3
3
-3
4
4
-4
>>> df.head(n=2)
col1
col2
0
0
0
1
1
-1
- property index¶
The index of the dataframe.
- Returns:
The index of the dataframe.
- Return type:
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> df
col1
col2
0
1
3
1
2
4
>>> df.index Index(array([0 1]), dtype='int64')
- property info¶
Returns a summary string of this dataframe.
- Returns:
A summary string of this dataframe.
- Return type:
str
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1, 2], 'col2': ["a", "b"]}) >>> df
col1
col2
0
1
a
1
2
b
>>> df.info "DataFrame(['col1', 'col2'], 2 rows, 20 B)"
- is_registered() bool [source]¶
Return True if the object is contained in the registry.
- Returns:
Indicates if the object is contained in the registry.
- Return type:
bool
- Raises:
RegistrationError – Raised if there’s a server-side error or a mismatch of registered components.
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
Example
>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) >>> df.register("my_table_name") >>> df.attach("my_table_name") >>> df.is_registered() True >>> df.unregister() >>> df.is_registered() False
- isin(values: pdarray | Dict | Series | DataFrame) DataFrame [source]¶
Determine whether each element in the DataFrame is contained in values.
- Parameters:
values (pdarray, dict, Series, or DataFrame) – The values to check for in DataFrame. Series can only have a single index.
- Returns:
Arkouda DataFrame of booleans showing whether each element in the DataFrame is contained in values.
- Return type:
See also
ak.Series.isin
Notes
Pandas supports values being an iterable type. In arkouda, we replace this with pdarray.
Pandas supports ~ operations. Currently, ak.DataFrame does not support this.
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col_A': ak.array([7, 3]), 'col_B':ak.array([1, 9])}) >>> display(df)
col_A
col_B
0
7
1
1
3
9
When values is a pdarray, check every value in the DataFrame to determine if it exists in values.
>>> df.isin(ak.array([0, 1]))
col_A
col_B
0
0
1
1
0
0
When values is a dict, the values in the dict are passed to check the column indicated by the key.
>>> df.isin({'col_A': ak.array([0, 3])})
col_A
col_B
0
0
0
1
1
0
When values is a Series, each column is checked if values is present positionally. This means that for True to be returned, the indexes must be the same.
>>> i = ak.Index(ak.arange(2)) >>> s = ak.Series(data=[3, 9], index=i) >>> df.isin(s)
col_A
col_B
0
0
0
1
0
1
When values is a DataFrame, the index and column must match. Note that 9 is not found because the column name does not match.
>>> other_df = ak.DataFrame({'col_A':ak.array([7, 3]), 'col_C':ak.array([0, 9])}) >>> df.isin(other_df)
col_A
col_B
0
1
0
1
1
0
- isna() DataFrame [source]¶
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. numpy.NaN values get mapped to True values. Everything else gets mapped to False values.
- Returns:
Mask of bool values for each element in DataFrame that indicates whether an element is an NA value.
- Return type:
Examples
>>> import arkouda as ak >>> ak.connect() >>> import numpy as np >>> df = ak.DataFrame({"A": [np.nan, 2, 2, 3], "B": [3, np.nan, 5, 6], ... "C": [1, np.nan, 2, np.nan], "D":["a","b","c","d"]}) >>> display(df)
A
B
C
D
0
nan
3
1
a
1
2
nan
nan
b
2
2
5
2
c
3
3
6
nan
d
>>> df.isna() A B C D 0 True False False False 1 False True True False 2 False False False False 3 False False True False (4 rows x 4 columns)
- load(prefix_path, file_format='INFER')[source]¶
Load dataframe from file. file_format needed for consistency with other load functions.
- Parameters:
prefix_path (str) – The prefix path for the data.
file_format (string, default = "INFER")
- Returns:
A dataframe loaded from the prefix_path.
- Return type:
Examples
To store data in <my_dir>/my_data_LOCALE0000, use “<my_dir>/my_data” as the prefix.
>>> import arkouda as ak >>> ak.connect() >>> import os.path >>> from pathlib import Path >>> my_path = os.path.join(os.getcwd(), 'hdf5_output','my_data') >>> Path(my_path).mkdir(parents=True, exist_ok=True) >>> df = ak.DataFrame({"A": ak.arange(5), "B": -1 * ak.arange(5)}) >>> df.save(my_path, file_type="distribute") >>> df.load(my_path)
A
B
0
0
0
1
1
-1
2
2
-2
3
3
-3
4
4
-4
- memory_usage(index=True, unit='B') Series [source]¶
Return the memory usage of each column in bytes.
The memory usage can optionally include the contribution of the index.
- Parameters:
index (bool, default True) – Specifies whether to include the memory usage of the DataFrame’s index in returned Series. If
index=True
, the memory usage of the index is the first item in the output.unit (str, default = "B") – Unit to return. One of {‘B’, ‘KB’, ‘MB’, ‘GB’}.
- Returns:
A Series whose index is the original column names and whose values is the memory usage of each column in bytes.
- Return type:
See also
arkouda.pdarrayclass.nbytes
,arkouda.index.Index.memory_usage
,arkouda.index.MultiIndex.memory_usage
,arkouda.series.Series.memory_usage
Examples
>>> import arkouda as ak >>> ak.connect() >>> dtypes = [ak.int64, ak.float64, ak.bool] >>> data = dict([(str(t), ak.ones(5000, dtype=ak.int64).astype(t)) for t in dtypes]) >>> df = ak.DataFrame(data) >>> display(df.head())
int64
float64
bool
0
1
1
True
1
1
1
True
2
1
1
True
3
1
1
True
4
1
1
True
>>> df.memory_usage()
0
Index
40000
int64
40000
float64
40000
bool
5000
>>> df.memory_usage(index=False)
0
int64
40000
float64
40000
bool
5000
>>> df.memory_usage(unit="KB")
0
Index
39.0625
int64
39.0625
float64
39.0625
bool
4.88281
To get the approximate total memory usage:
>>> df.memory_usage(index=True).sum()
- memory_usage_info(unit='GB')[source]¶
A formatted string representation of the size of this DataFrame.
- Parameters:
unit (str, default = "GB") – Unit to return. One of {‘KB’, ‘MB’, ‘GB’}.
- Returns:
A string representation of the number of bytes used by this DataFrame in [unit]s.
- Return type:
str
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': ak.arange(1000), 'col2': ak.arange(1000)}) >>> df.memory_usage_info() '0.00 GB'
>>> df.memory_usage_info(unit="KB") '15 KB'
- merge(right: DataFrame, on: str | List[str] | None = None, how: str = 'inner', left_suffix: str = '_x', right_suffix: str = '_y', convert_ints: bool = True, sort: bool = True) DataFrame [source]¶
Merge Arkouda DataFrames with a database-style join. The resulting dataframe contains rows from both DataFrames as specified by the merge condition (based on the “how” and “on” parameters).
Based on pandas merge functionality. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
- Parameters:
right (DataFrame) – The Right DataFrame to be joined.
on (Optional[Union[str, List[str]]] = None) – The name or list of names of the DataFrame column(s) to join on. If on is None, this defaults to the intersection of the columns in both DataFrames.
how ({"inner", "left", "right}, default = "inner") – The merge condition. Must be “inner”, “left”, or “right”.
left_suffix (str, default = "_x") – A string indicating the suffix to add to columns from the left dataframe for overlapping column names in both left and right. Defaults to “_x”. Only used when how is “inner”.
right_suffix (str, default = "_y") – A string indicating the suffix to add to columns from the right dataframe for overlapping column names in both left and right. Defaults to “_y”. Only used when how is “inner”.
convert_ints (bool = True) – If True, convert columns with missing int values (due to the join) to float64. This is to match pandas. If False, do not convert the column dtypes. This has no effect when how = “inner”.
sort (bool = True) – If True, DataFrame is returned sorted by “on”. Otherwise, the DataFrame is not sorted.
- Returns:
Joined Arkouda DataFrame.
- Return type:
Note
Multiple column joins are only supported for integer columns.
Examples
>>> import arkouda as ak >>> ak.connect() >>> left_df = ak.DataFrame({'col1': ak.arange(5), 'col2': -1 * ak.arange(5)}) >>> display(left_df)
col1
col2
0
0
0
1
1
-1
2
2
-2
3
3
-3
4
4
-4
>>> right_df = ak.DataFrame({'col1': 2 * ak.arange(5), 'col2': 2 * ak.arange(5)}) >>> display(right_df)
col1
col2
0
0
0
1
2
2
2
4
4
3
6
6
4
8
8
>>> left_df.merge(right_df, on = "col1")
col1
col2_x
col2_y
0
0
0
0
1
2
-2
2
2
4
-4
4
>>> left_df.merge(right_df, on = "col1", how = "left")
col1
col2_y
col2_x
0
0
0
0
1
1
nan
-1
2
2
2
-2
3
3
nan
-3
4
4
4
-4
>>> left_df.merge(right_df, on = "col1", how = "right")
col1
col2_x
col2_y
0
0
0
0
1
2
-2
2
2
4
-4
4
3
6
nan
6
4
8
nan
8
>>> left_df.merge(right_df, on = "col1", how = "outer")
col1
col2_y
col2_x
0
0
0
0
1
1
nan
-1
2
2
2
-2
3
3
nan
-3
4
4
4
-4
5
6
6
nan
6
8
8
nan
- notna() DataFrame [source]¶
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. numpy.NaN values get mapped to False values.
- Returns:
Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.
- Return type:
Examples
>>> import arkouda as ak >>> ak.connect() >>> import numpy as np >>> df = ak.DataFrame({"A": [np.nan, 2, 2, 3], "B": [3, np.nan, 5, 6], ... "C": [1, np.nan, 2, np.nan], "D":["a","b","c","d"]}) >>> display(df)
A
B
C
D
0
nan
3
1
a
1
2
nan
nan
b
2
2
5
2
c
3
3
6
nan
d
>>> df.notna() A B C D 0 False True True True 1 True False False True 2 True True True True 3 True True False True (4 rows x 4 columns)
- objType(*args, **kwargs)¶
str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.
- read_csv(filename: str, col_delim: str = ',')[source]¶
Read the columns of a CSV file into an Arkouda DataFrame. If the file contains the appropriately formatted header, typed data will be returned. Otherwise, all data will be returned as a Strings objects.
- Parameters:
filename (str) – Filename to read data from.
col_delim (str, default=",") – The delimiter for columns within the data.
- Returns:
Arkouda DataFrame containing the columns from the CSV file.
- Return type:
- Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist.
RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server.
See also
Notes
CSV format is not currently supported by load/load_all operations.
The column delimiter is expected to be the same for column names and data.
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (”\n”) at this time.
Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing
bytes as uint(8).
Examples
>>> import arkouda as ak >>> ak.connect() >>> import os.path >>> from pathlib import Path >>> my_path = os.path.join(os.getcwd(), 'csv_output','my_data') >>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]}) >>> df.to_csv(my_path) >>> df2 = DataFrame.read_csv(my_path + "_LOCALE0000") >>> display(df2)
A
B
0
1
3
1
2
4
- register(user_defined_name: str) DataFrame [source]¶
Register this DataFrame object and underlying components with the Arkouda server.
- Parameters:
user_defined_name (str) – User defined name the DataFrame is to be registered under. This will be the root name for underlying components.
- Returns:
The same DataFrame which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different DataFrames with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str.
RegistrationError – If the server was unable to register the DataFrame with the user_defined_name.
See also
unregister
,attach
,unregister_dataframe_by_name
,is_registered
Notes
Objects registered with the server are immune to deletion until they are unregistered.
Any changes made to a DataFrame object after registering with the server may not be reflected in attached copies.
Example
>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) >>> df.register("my_table_name") >>> df.attach("my_table_name") >>> df.is_registered() True >>> df.unregister() >>> df.is_registered() False
- rename(mapper: Callable | Dict | None = None, index: Callable | Dict | None = None, column: Callable | Dict | None = None, axis: str | int = 0, inplace: bool = False) DataFrame | None [source]¶
Rename indexes or columns according to a mapping.
- Parameters:
mapper (callable or dict-like, Optional) – Function or dictionary mapping existing values to new values. Nonexistent names will not raise an error. Uses the value of axis to determine if renaming column or index
column (callable or dict-like, Optional) – Function or dictionary mapping existing column names to new column names. Nonexistent names will not raise an error. When this is set, axis is ignored.
index (callable or dict-like, Optional) – Function or dictionary mapping existing index names to new index names. Nonexistent names will not raise an error. When this is set, axis is ignored.
axis (int or str, default=0) – Indicates which axis to perform the rename. 0/”index” - Indexes 1/”column” - Columns
inplace (bool, default=False) – When True, perform the operation on the calling object. When False, return a new object.
- Returns:
DateFrame when inplace=False; None when inplace=True.
- Return type:
arkouda.dataframe.DataFrame or None
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({"A": ak.array([1, 2, 3]), "B": ak.array([4, 5, 6])}) >>> display(df)
A
B
0
1
4
1
2
5
2
3
6
Rename columns using a mapping:
>>> df.rename(column={'A':'a', 'B':'c'})
a
c
0
1
4
1
2
5
2
3
6
Rename indexes using a mapping:
>>> df.rename(index={0:99, 2:11})
A
B
0
1
4
1
2
5
2
3
6
Rename using an axis style parameter:
>>> df.rename(str.lower, axis='column')
a
b
0
1
4
1
2
5
2
3
6
- reset_index(size: int | None = None, inplace: bool = False) None | DataFrame [source]¶
Set the index to an integer range.
Useful if this dataframe is the result of a slice operation from another dataframe, or if you have permuted the rows and no longer need to keep that ordering on the rows.
- Parameters:
size (int, optional) – If size is passed, do not attempt to determine size based on existing column sizes. Assume caller handles consistency correctly.
inplace (bool, default=False) – When True, perform the operation on the calling object. When False, return a new object.
- Returns:
DateFrame when inplace=False; None when inplace=True.
- Return type:
arkouda.dataframe.DataFrame or None
Note
Pandas adds a column ‘index’ to indicate the original index. Arkouda does not currently support this behavior.
Example
>>> df = ak.DataFrame({"A": ak.array([1, 2, 3]), "B": ak.array([4, 5, 6])}) >>> display(df)
A
B
0
1
4
1
2
5
2
3
6
>>> perm_df = df[ak.array([0,2,1])] >>> display(perm_df)
A
B
0
1
4
1
3
6
2
2
5
>>> perm_df.reset_index()
A
B
0
1
4
1
3
6
2
2
5
- sample(n=5)[source]¶
Return a random sample of n rows.
- Parameters:
n (int, default=5) – Number of rows to return.
- Returns:
The sampled n rows of the DataFrame.
- Return type:
Example
>>> df = ak.DataFrame({"A": ak.arange(5), "B": -1 * ak.arange(5)}) >>> display(df)
A
B
0
0
0
1
1
-1
2
2
-2
3
3
-3
4
4
-4
Random output of size 3:
>>> df.sample(n=3)
A
B
0
0
0
1
1
-1
2
4
-4
- save(path, index=False, columns=None, file_format='HDF5', file_type='distribute', compression: str | None = None)[source]¶
DEPRECATED Save DataFrame to disk, preserving column names.
- Parameters:
path (str) – File path to save data.
index (bool, default=False) – If True, save the index column. By default, do not save the index.
columns (list, default=None) – List of columns to include in the file. If None, writes out all columns.
file_format (str, default='HDF5') – ‘HDF5’ or ‘Parquet’. Defaults to ‘HDF5’
file_type (str, default=distribute) – “single” or “distribute” If single, will right a single file to locale 0.
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Compression type. Only used for Parquet
Notes
This method saves one file per locale of the arkouda server. All files are prefixed by the path argument and suffixed by their locale number.
See also
Examples
>>> import arkouda as ak >>> ak.connect() >>> import os.path >>> from pathlib import Path >>> my_path = os.path.join(os.getcwd(), 'hdf5_output') >>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A": ak.arange(5), "B": -1 * ak.arange(5)}) >>> df.save(my_path + '/my_data', file_type="single") >>> df.load(my_path + '/my_data')
A
B
0
0
0
1
1
-1
2
2
-2
3
3
-3
4
4
-4
- property shape¶
The shape of the dataframe.
- Returns:
Tuple of array dimensions.
- Return type:
tuple of int
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) >>> df
col1
col2
0
1
4
1
2
5
2
3
6
>>> df.shape (3, 2)
- property size¶
Returns the number of bytes on the arkouda server.
- Returns:
The number of bytes on the arkouda server.
- Return type:
int
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) >>> df
col1
col2
0
1
4
1
2
5
2
3
6
>>> df.size 6
- sort_index(ascending=True)[source]¶
Sort the DataFrame by indexed columns.
Note: Fails on sort order of arkouda.strings.Strings columns when multiple columns being sorted.
- Parameters:
ascending (bool, default = True) – Sort values in ascending (default) or descending order.
Example
>>> df = ak.DataFrame({'col1': [1.1, 3.1, 2.1], 'col2': [6, 5, 4]}, ... index = Index(ak.array([2,0,1]), name="idx"))
>>> display(df)
idx
col1
col2
0
1.1
6
1
3.1
5
2
2.1
4
>>> df.sort_index()
idx
col1
col2
0
3.1
5
1
2.1
4
2
1.1
6
- sort_values(by=None, ascending=True)[source]¶
Sort the DataFrame by one or more columns.
If no column is specified, all columns are used.
Note: Fails on order of arkouda.strings.Strings columns when multiple columns being sorted.
- Parameters:
by (str or list/tuple of str, default = None) – The name(s) of the column(s) to sort by.
ascending (bool, default = True) – Sort values in ascending (default) or descending order.
See also
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [2, 2, 1], 'col2': [3, 4, 3], 'col3':[5, 6, 7]}) >>> display(df)
col1
col2
col3
0
2
3
5
1
2
4
6
2
1
3
7
>>> df.sort_values()
col1
col2
col3
0
1
3
7
1
2
3
5
2
2
4
6
>>> df.sort_values("col3")
col1
col2
col3
0
1
3
7
1
2
3
5
2
2
4
6
- tail(n=5)[source]¶
Return the last n rows.
This function returns the last n rows for the dataframe. It is useful for quickly testing if your object has the right type of data in it.
- Parameters:
n (int, default=5) – Number of rows to select.
- Returns:
The last n rows of the DataFrame.
- Return type:
See also
arkouda.dataframe.head
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': ak.arange(10), 'col2': -1 * ak.arange(10)}) >>> display(df)
col1
col2
0
0
0
1
1
-1
2
2
-2
3
3
-3
4
4
-4
5
5
-5
6
6
-6
7
7
-7
8
8
-8
9
9
-9
>>> df.tail()
col1
col2
0
5
-5
1
6
-6
2
7
-7
3
8
-8
4
9
-9
>>> df.tail(n=2)
col1
col2
0
8
-8
1
9
-9
- to_csv(path: str, index: bool = False, columns: List[str] | None = None, col_delim: str = ',', overwrite: bool = False)[source]¶
Writes DataFrame to CSV file(s). File will contain a column for each column in the DataFrame. All CSV Files written by Arkouda include a header denoting data types of the columns. Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).
- Parameters:
path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.
index (bool, default=False) – If True, the index of the DataFrame will be written to the file as a column.
columns (list of str (Optional)) – Column names to assign when writing data.
col_delim (str, default=",") – Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.
overwrite (bool, default=False) – If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.
- Return type:
None
- Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist.
RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server.
Notes
CSV format is not currently supported by load/load_all operations.
The column delimiter is expected to be the same for column names and data.
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (”\n”) at this time.
Examples
>>> import arkouda as ak >>> ak.connect() >>> import os.path >>> from pathlib import Path >>> my_path = os.path.join(os.getcwd(), 'csv_output') >>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]}) >>> df.to_csv(my_path + "/my_data") >>> df2 = DataFrame.read_csv(my_path + "/my_data" + "_LOCALE0000") >>> display(df2)
A
B
0
1
3
1
2
4
- to_hdf(path, index=False, columns=None, file_type='distribute')[source]¶
Save DataFrame to disk as hdf5, preserving column names.
- Parameters:
path (str) – File path to save data.
index (bool, default=False) – If True, save the index column. By default, do not save the index.
columns (List, default = None) – List of columns to include in the file. If None, writes out all columns.
file_type (str (single | distribute), default=distribute) – Whether to save to a single file or distribute across Locales.
- Return type:
None
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray.
Notes
This method saves one file per locale of the arkouda server. All files are prefixed by the path argument and suffixed by their locale number.
See also
Examples
>>> import arkouda as ak >>> ak.connect() >>> import os.path >>> from pathlib import Path >>> my_path = os.path.join(os.getcwd(), 'hdf_output') >>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]}) >>> df.to_hdf(my_path + "/my_data") >>> df.load(my_path + "/my_data")
A
B
0
1
3
1
2
4
- to_markdown(mode='wt', index=True, tablefmt='grid', storage_options=None, **kwargs)[source]¶
Print DataFrame in Markdown-friendly format.
- Parameters:
mode (str, optional) – Mode in which file is opened, “wt” by default.
index (bool, optional, default True) – Add index (row) labels.
tablefmt (str = "grid") – Table format to call from tablulate: https://pypi.org/project/tabulate/
storage_options (dict, optional) – Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc., if using a URL that will be parsed by fsspec, e.g., starting “s3://”, “gcs://”. An error will be raised if providing this argument with a non-fsspec URL. See the fsspec and backend storage implementation docs for the set of allowed keys and values.
**kwargs – These parameters will be passed to tabulate.
Note
This function should only be called on small DataFrames as it calls pandas.DataFrame.to_markdown: https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.DataFrame.to_markdown.html
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({"animal_1": ["elk", "pig"], "animal_2": ["dog", "quetzal"]}) >>> print(df.to_markdown()) +----+------------+------------+ | | animal_1 | animal_2 | +====+============+============+ | 0 | elk | dog | +----+------------+------------+ | 1 | pig | quetzal | +----+------------+------------+
Suppress the index:
>>> print(df.to_markdown(index = False)) +------------+------------+ | animal_1 | animal_2 | +============+============+ | elk | dog | +------------+------------+ | pig | quetzal | +------------+------------+
- to_pandas(datalimit=1073741824, retain_index=False)[source]¶
Send this DataFrame to a pandas DataFrame.
- Parameters:
datalimit (int, default=arkouda.client.maxTransferBytes) – The maximum number size, in megabytes to transfer. The requested DataFrame will be converted to a pandas DataFrame only if the estimated size of the DataFrame does not exceed this value.
retain_index (bool, default=False) – Normally, to_pandas() creates a new range index object. If you want to keep the index column, set this to True.
- Returns:
The result of converting this DataFrame to a pandas DataFrame.
- Return type:
pandas.DataFrame
Examples
>>> import arkouda as ak >>> ak.connect() >>> ak_df = ak.DataFrame({"A": ak.arange(2), "B": -1 * ak.arange(2)}) >>> type(ak_df) arkouda.dataframe.DataFrame >>> display(ak_df)
A
B
0
0
0
1
1
-1
>>> import pandas as pd >>> pd_df = ak_df.to_pandas() >>> type(pd_df) pandas.core.frame.DataFrame >>> display(pd_df)
A
B
0
0
0
1
1
-1
- to_parquet(path, index=False, columns=None, compression: str | None = None, convert_categoricals: bool = False)[source]¶
Save DataFrame to disk as parquet, preserving column names.
- Parameters:
path (str) – File path to save data.
index (bool, default=False) – If True, save the index column. By default, do not save the index.
columns (list) – List of columns to include in the file. If None, writes out all columns.
compression (str (Optional), default=None) – Provide the compression type to use when writing the file. Supported values: snappy, gzip, brotli, zstd, lz4
convert_categoricals (bool, default=False) – Parquet requires all columns to be the same size and Categoricals don’t satisfy that requirement. If set, write the equivalent Strings in place of any Categorical columns.
- Return type:
None
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
This method saves one file per locale of the arkouda server. All files are prefixed by the path argument and suffixed by their locale number.
Examples
>>> import arkouda as ak >>> ak.connect() >>> import os.path >>> from pathlib import Path >>> my_path = os.path.join(os.getcwd(), 'parquet_output') >>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]}) >>> df.to_parquet(my_path + "/my_data") >>> df.load(my_path + "/my_data")
B
A
0
3
1
1
4
2
- transfer(hostname, port)[source]¶
Sends a DataFrame to a different Arkouda server.
- Parameters:
hostname (str) – The hostname where the Arkouda server intended to receive the DataFrame is running.
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().
- Returns:
A message indicating a complete transfer.
- Return type:
str
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- unregister()[source]¶
Unregister this DataFrame object in the arkouda server which was previously registered using register() and/or attached to using attach().
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister.
See also
register
,attach
,unregister_dataframe_by_name
,is_registered
Notes
Objects registered with the server are immune to deletion until they are unregistered.
Example
>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) >>> df.register("my_table_name") >>> df.attach("my_table_name") >>> df.is_registered() True >>> df.unregister() >>> df.is_registered() False
- unregister_dataframe_by_name(user_defined_name: str) str [source]¶
Function to unregister DataFrame object by name which was registered with the arkouda server via register().
- Parameters:
user_defined_name (str) – Name under which the DataFrame object was registered.
- Raises:
TypeError – If user_defined_name is not a string.
RegistrationError – If there is an issue attempting to unregister any underlying components.
See also
Example
>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) >>> df.register("my_table_name") >>> df.attach("my_table_name") >>> df.is_registered() True >>> df.unregister_dataframe_by_name("my_table_name") >>> df.is_registered() False
- update_hdf(prefix_path: str, index=False, columns=None, repack: bool = True)[source]¶
Overwrite the dataset with the name provided with this dataframe. If the dataset does not exist it is added.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share.
index (bool, default=False) – If True, save the index column. By default, do not save the index.
columns (List, default=None) – List of columns to include in the file. If None, writes out all columns.
repack (bool, default=True) – HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Returns:
Success message if successful.
- Return type:
str
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray.
Notes
- If file does not contain File_Format attribute to indicate how it was saved,
the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added.
Examples
>>> import arkouda as ak >>> ak.connect() >>> import os.path >>> from pathlib import Path >>> my_path = os.path.join(os.getcwd(), 'hdf_output') >>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]}) >>> df.to_hdf(my_path + "/my_data") >>> df.load(my_path + "/my_data")
A
B
0
1
3
1
2
4
>>> df2 = ak.DataFrame({"A":[5,6],"B":[7,8]}) >>> df2.update_hdf(my_path + "/my_data") >>> df.load(my_path + "/my_data")
A
B
0
5
7
1
6
8
- class arkouda.dataframe.DataFrameGroupBy[source]¶
A DataFrame that has been grouped by a subset of columns.
- Parameters:
gb_key_names (str or list(str), default=None) – The column name(s) associated with the aggregated columns.
as_index (bool, default=True) – If True, interpret aggregated column as index (only implemented for single dimensional aggregates). Otherwise, treat aggregated column as a dataframe column.
- gb¶
GroupBy object, where the aggregation keys are values of column(s) of a dataframe, usually in preparation for aggregating with respect to the other columns.
- df¶
The dataframe containing the original data.
- gb_key_names¶
The column name(s) associated with the aggregated columns.
- Type:
str or list(str)
- as_index¶
If True the grouped values of the aggregation keys will be treated as an index.
- Type:
bool, default=True
- all(colnames=None)¶
Aggregate the operation, with the grouped column(s) values as keys.
- Parameters:
colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.
- Return type:
- any(colnames=None)¶
Aggregate the operation, with the grouped column(s) values as keys.
- Parameters:
colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.
- Return type:
- argmax(colnames=None)¶
Aggregate the operation, with the grouped column(s) values as keys.
- Parameters:
colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.
- Return type:
- argmin(colnames=None)¶
Aggregate the operation, with the grouped column(s) values as keys.
- Parameters:
colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.
- Return type:
- broadcast(x, permute=True)[source]¶
Fill each group’s segment with a constant value.
- Parameters:
- Returns:
A Series with the Index of the original frame and the values of the broadcast.
- Return type:
Examples
>>> import arkouda as ak >>> ak.connect() >>> from arkouda.dataframe import DataFrameGroupBy >>> df = ak.DataFrame({"A":[1,2,2,3],"B":[3,4,5,6]})
A
B
0
1
3
1
2
4
2
2
5
3
3
6
>>> gb = df.groupby("A") >>> x = ak.array([10,11,12]) >>> s = DataFrameGroupBy.broadcast(gb, x) >>> df["C"] = s.values >>> display(df)
A
B
C
0
1
3
10
1
2
4
11
2
2
5
11
3
3
6
12
- count(colnames=None)¶
Aggregate the operation, with the grouped column(s) values as keys.
- Parameters:
colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.
- Return type:
- diff(colname)[source]¶
Create a difference aggregate for the given column.
For each group, the difference between successive values is calculated. Aggregate operations (mean,min,max,std,var) can be done on the results.
- Parameters:
colname (str) – Name of the column to compute the difference on.
- Returns:
Object containing the differences, which can be aggregated.
- Return type:
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({"A":[1,2,2,2,3,3],"B":[3,9,11,27,86,100]}) >>> display(df)
A
B
0
1
3
1
2
9
2
2
11
3
2
27
4
3
86
5
3
100
>>> gb = df.groupby("A") >>> gb.diff("B").values array([nan nan 2.00000000000000000 16.00000000000000000 nan 14.00000000000000000])
- first(colnames=None)¶
Aggregate the operation, with the grouped column(s) values as keys.
- Parameters:
colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.
- Return type:
- head(n: int = 5, sort_index: bool = True) DataFrame [source]¶
Return the first n rows from each group.
- Parameters:
n (int, optional, default = 5) – Maximum number of rows to return for each group. If the number of rows in a group is less than n, all the values from that group will be returned.
sort_index (bool, default = True) – If true, return the DataFrame with indices sorted.
- Return type:
Examples
>>> import arkouda as ak >>> from arkouda import * >>> df = ak.DataFrame({"a":ak.arange(10) %3 , "b":ak.arange(10)})
a
b
0
0
0
1
1
1
2
2
2
3
0
3
4
1
4
5
2
5
6
0
6
7
1
7
8
2
8
9
0
9
>>> df.groupby("a").head(2)
a
b
0
0
0
1
0
3
2
1
1
3
1
4
4
2
2
5
2
5
- max(colnames=None)¶
Aggregate the operation, with the grouped column(s) values as keys.
- Parameters:
colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.
- Return type:
- mean(colnames=None)¶
Aggregate the operation, with the grouped column(s) values as keys.
- Parameters:
colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.
- Return type:
- median(colnames=None)¶
Aggregate the operation, with the grouped column(s) values as keys.
- Parameters:
colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.
- Return type:
- min(colnames=None)¶
Aggregate the operation, with the grouped column(s) values as keys.
- Parameters:
colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.
- Return type:
- mode(colnames=None)¶
Aggregate the operation, with the grouped column(s) values as keys.
- Parameters:
colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.
- Return type:
- nunique(colnames=None)¶
Aggregate the operation, with the grouped column(s) values as keys.
- Parameters:
colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.
- Return type:
- prod(colnames=None)¶
Aggregate the operation, with the grouped column(s) values as keys.
- Parameters:
colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.
- Return type:
- sample(n=None, frac=None, replace=False, weights=None, random_state=None)[source]¶
Return a random sample from each group. You can either specify the number of elements or the fraction of elements to be sampled. random_state can be used for reproducibility
- Parameters:
n (int, optional) – Number of items to return for each group. Cannot be used with frac and must be no larger than the smallest group unless replace is True. Default is one if frac is None.
frac (float, optional) – Fraction of items to return. Cannot be used with n.
replace (bool, default False) – Allow or disallow sampling of the same row more than once.
weights (pdarray, optional) – Default None results in equal probability weighting. If passed a pdarray, then values must have the same length as the underlying DataFrame and will be used as sampling probabilities after normalization within each group. Weights must be non-negative with at least one positive element within each group.
random_state (int or ak.random.Generator, optional) – If int, seed for random number generator. If ak.random.Generator, use as given.
- Returns:
A new DataFrame containing items randomly sampled from each group sorted according to the grouped columns.
- Return type:
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({"A":[3,1,2,1,2,3],"B":[3,4,5,6,7,8]}) >>> display(df) +----+-----+-----+ | | A | B | +====+=====+=====+ | 0 | 3 | 3 | +----+-----+-----+ | 1 | 1 | 4 | +----+-----+-----+ | 2 | 2 | 5 | +----+-----+-----+ | 3 | 1 | 6 | +----+-----+-----+ | 4 | 2 | 7 | +----+-----+-----+ | 5 | 3 | 8 | +----+-----+-----+
>>> df.groupby("A").sample(random_state=6)
A
B
3
1
6
4
2
7
5
3
8
>>> df.groupby("A").sample(frac=0.5, random_state=3, weights=ak.array([1,1,1,0,0,0]))
A
B
1
1
4
2
2
5
0
3
3
>>> df.groupby("A").sample(n=3, replace=True, random_state=ak.random.default_rng(7)) +----+-----+-----+ | | A | B | +====+=====+=====+ | 1 | 1 | 4 | +----+-----+-----+ | 3 | 1 | 6 | +----+-----+-----+ | 1 | 1 | 4 | +----+-----+-----+ | 4 | 2 | 7 | +----+-----+-----+ | 4 | 2 | 7 | +----+-----+-----+ | 4 | 2 | 7 | +----+-----+-----+ | 0 | 3 | 3 | +----+-----+-----+ | 5 | 3 | 8 | +----+-----+-----+ | 5 | 3 | 8 | +----+-----+-----+
- size(as_series=None, sort_index=True)[source]¶
Compute the size of each value as the total number of rows, including NaN values.
- Parameters:
as_series (bool, default=None) – Indicates whether to return arkouda.dataframe.DataFrame (if as_series = False) or arkouda.series.Series (if as_series = True)
sort_index (bool, default=True) – If True, results will be returned with index values sorted in ascending order.
- Return type:
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({"A":[1,2,2,3],"B":[3,4,5,6]}) >>> display(df)
A
B
0
1
3
1
2
4
2
2
5
3
3
6
>>> df.groupby("A").size(as_series = False)
size
0
1
1
2
2
1
- std(colnames=None)¶
Aggregate the operation, with the grouped column(s) values as keys.
- Parameters:
colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.
- Return type:
- sum(colnames=None)¶
Aggregate the operation, with the grouped column(s) values as keys.
- Parameters:
colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.
- Return type:
- tail(n: int = 5, sort_index: bool = True) DataFrame [source]¶
Return the last n rows from each group.
- Parameters:
n (int, optional, default = 5) – Maximum number of rows to return for each group. If the number of rows in a group is less than n, all the rows from that group will be returned.
sort_index (bool, default = True) – If true, return the DataFrame with indices sorted.
- Return type:
Examples
>>> import arkouda as ak >>> from arkouda import * >>> df = ak.DataFrame({"a":ak.arange(10) %3 , "b":ak.arange(10)})
a
b
0
0
0
1
1
1
2
2
2
3
0
3
4
1
4
5
2
5
6
0
6
7
1
7
8
2
8
9
0
9
>>> df.groupby("a").tail(2)
a
b
0
0
6
1
0
9
2
1
4
3
1
7
4
2
5
5
2
8
- unique(colnames=None)¶
Aggregate the operation, with the grouped column(s) values as keys.
- Parameters:
colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.
- Return type:
- var(colnames=None)¶
Aggregate the operation, with the grouped column(s) values as keys.
- Parameters:
colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.
- Return type:
- xor(colnames=None)¶
Aggregate the operation, with the grouped column(s) values as keys.
- Parameters:
colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.
- Return type:
- class arkouda.dataframe.DiffAggregate[source]¶
A column in a GroupBy that has been differenced. Aggregation operations can be done on the result.
- gb¶
GroupBy object, where the aggregation keys are values of column(s) of a dataframe.
- values¶
A column to compute the difference on.
- Type:
arkouda.series.Series.
- all()¶
- any()¶
- argmax()¶
- argmin()¶
- count()¶
- first()¶
- max()¶
- mean()¶
- median()¶
- min()¶
- mode()¶
- nunique()¶
- prod()¶
- std()¶
- sum()¶
- unique()¶
- var()¶
- xor()¶
- arkouda.dataframe.intersect(a, b, positions=True, unique=False)[source]¶
Find the intersection of two arkouda arrays.
This function can be especially useful when positions=True so that the caller gets the indices of values present in both arrays.
- Parameters:
positions (bool, default=True) – Return tuple of boolean pdarrays that indicate positions in a and b of the intersection values.
unique (bool, default=False) – If the number of distinct values in a (and b) is equal to the size of a (and b), there is a more efficient method to compute the intersection.
- Returns:
The indices of a and b where any element occurs at least once in both arrays.
- Return type:
(arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray) or arkouda.pdarrayclass.pdarray
Examples
>>> import arkouda as ak >>> ak.connect() >>> a = ak.arange(10) >>> print(a) [0 1 2 3 4 5 6 7 8 9]
>>> b = 2 * ak.arange(10) >>> print(b) [0 2 4 6 8 10 12 14 16 18]
>>> intersect(a,b, positions=True) (array([True False True False True False True False True False]), array([True True True True True False False False False False]))
>>> intersect(a,b, positions=False) array([0 2 4 6 8])
- arkouda.dataframe.intx(a, b)[source]¶
Find all the rows that are in both dataframes. Columns should be in identical order.
Note: does not work for columns of floating point values, but does work for Strings, pdarrays of int64 type, and Categorical should work.
Examples
>>> import arkouda as ak >>> ak.connect() >>> a = ak.DataFrame({'a':ak.arange(5),'b': 2* ak.arange(5)}) >>> display(a)
a
b
0
0
0
1
1
2
2
2
4
3
3
6
4
4
8
>>> b = ak.DataFrame({'a':ak.arange(5),'b':ak.array([0,3,4,7,8])}) >>> display(b)
a
b
0
0
0
1
1
3
2
2
4
3
3
7
4
4
8
>>> intx(a,b) >>> intersect_df = a[intx(a,b)] >>> display(intersect_df)
a
b
0
0
0
1
2
4
2
4
8
- arkouda.dataframe.invert_permutation(perm)[source]¶
Find the inverse of a permutation array.
- Parameters:
perm (pdarray) – The permutation array.
- Returns:
The inverse of the permutation array.
- Return type:
Examples
>>> import arkouda as ak >>> ak.connect() >>> from arkouda.index import Index >>> i = Index(ak.array([1,2,0,5,4])) >>> perm = i.argsort() >>> print(perm) [2 0 1 4 3] >>> invert_permutation(perm) array([1 2 0 4 3])
- arkouda.dataframe.merge(left: DataFrame, right: DataFrame, on: str | List[str] | None = None, how: str = 'inner', left_suffix: str = '_x', right_suffix: str = '_y', convert_ints: bool = True, sort: bool = True) DataFrame [source]¶
Merge Arkouda DataFrames with a database-style join. The resulting dataframe contains rows from both DataFrames as specified by the merge condition (based on the “how” and “on” parameters).
Based on pandas merge functionality. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
- Parameters:
left (DataFrame) – The Left DataFrame to be joined.
right (DataFrame) – The Right DataFrame to be joined.
on (Optional[Union[str, List[str]]] = None) – The name or list of names of the DataFrame column(s) to join on. If on is None, this defaults to the intersection of the columns in both DataFrames.
how (str, default = "inner") – The merge condition. Must be one of “inner”, “left”, “right”, or “outer”.
left_suffix (str, default = "_x") – A string indicating the suffix to add to columns from the left dataframe for overlapping column names in both left and right. Defaults to “_x”. Only used when how is “inner”.
right_suffix (str, default = "_y") – A string indicating the suffix to add to columns from the right dataframe for overlapping column names in both left and right. Defaults to “_y”. Only used when how is “inner”.
convert_ints (bool = True) – If True, convert columns with missing int values (due to the join) to float64. This is to match pandas. If False, do not convert the column dtypes. This has no effect when how = “inner”.
sort (bool = True) – If True, DataFrame is returned sorted by “on”. Otherwise, the DataFrame is not sorted.
- Returns:
Joined Arkouda DataFrame.
- Return type:
Note
Multiple column joins are only supported for integer columns.
Examples
>>> import arkouda as ak >>> ak.connect() >>> from arkouda import merge >>> left_df = ak.DataFrame({'col1': ak.arange(5), 'col2': -1 * ak.arange(5)}) >>> display(left_df)
col1
col2
0
0
0
1
1
-1
2
2
-2
3
3
-3
4
4
-4
>>> right_df = ak.DataFrame({'col1': 2 * ak.arange(5), 'col2': 2 * ak.arange(5)}) >>> display(right_df)
col1
col2
0
0
0
1
2
2
2
4
4
3
6
6
4
8
8
>>> merge(left_df, right_df, on = "col1")
col1
col2_x
col2_y
0
0
0
0
1
2
-2
2
2
4
-4
4
>>> merge(left_df, right_df, on = "col1", how = "left")
col1
col2_y
col2_x
0
0
0
0
1
1
nan
-1
2
2
2
-2
3
3
nan
-3
4
4
4
-4
>>> merge(left_df, right_df, on = "col1", how = "right")
col1
col2_x
col2_y
0
0
0
0
1
2
-2
2
2
4
-4
4
3
6
nan
6
4
8
nan
8
>>> merge(left_df, right_df, on = "col1", how = "outer")
col1
col2_y
col2_x
0
0
0
0
1
1
nan
-1
2
2
2
-2
3
3
nan
-3
4
4
4
-4
5
6
6
nan
6
8
8
nan