DataFrames in Arkouda¶
Like Pandas, Arkouda supports DataFrames
. The purpose and intended functionality remains the same in Arkouda, but are configured to be based on arkouda.pdarrays
.
- class arkouda.DataFrame(initialdata=None, index=None, columns=None)[source]¶
A DataFrame structure based on arkouda arrays.
- Parameters:
initialdata (List or dictionary of lists, tuples, or pdarrays) – Each list/dictionary entry corresponds to one column of the data and should be a homogenous type. Different columns may have different types. If using a dictionary, keys should be strings.
index (Index, pdarray, or Strings) – Index for the resulting frame. Defaults to an integer range.
columns (List, tuple, pdarray, or Strings) – Column labels to use if the data does not include them. Elements must be strings. Defaults to an stringified integer range.
Examples
Create an empty DataFrame and add a column of data:
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame() >>> df['a'] = ak.array([1,2,3]) >>> display(df)
a
0
1
1
2
2
3
Create a new DataFrame using a dictionary of data:
>>> userName = ak.array(['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice']) >>> userID = ak.array([111, 222, 111, 333, 222, 111]) >>> item = ak.array([0, 0, 1, 1, 2, 0]) >>> day = ak.array([5, 5, 6, 5, 6, 6]) >>> amount = ak.array([0.5, 0.6, 1.1, 1.2, 4.3, 0.6]) >>> df = ak.DataFrame({'userName': userName, 'userID': userID, >>> 'item': item, 'day': day, 'amount': amount}) >>> display(df)
userName
userID
item
day
amount
0
Alice
111
0
5
0.5
1
Bob
222
0
5
0.6
2
Alice
111
1
6
1.1
3
Carol
333
1
5
1.2
4
Bob
222
2
6
4.3
5
Alice
111
0
6
0.6
Indexing works slightly differently than with pandas:
>>> df[0]
keys
values
userName
Alice
userID
111
item
0
day
5
amount
0.5
>>> df['userID'] array([111, 222, 111, 333, 222, 111])
>>> df['userName'] array(['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice'])
>>> df[ak.array([1,3,5])]
userName
userID
item
day
amount
0
Bob
222
0
5
0.6
1
Carol
333
1
5
1.2
2
Alice
111
0
6
0.6
Compute the stride:
>>> df[1:5:1]
userName
userID
item
day
amount
0
Bob
222
0
5
0.6
1
Alice
111
1
6
1.1
2
Carol
333
1
5
1.2
3
Bob
222
2
6
4.3
>>> df[ak.array([1,2,3])]
userName
userID
item
day
amount
0
Bob
222
0
5
0.6
1
Alice
111
1
6
1.1
2
Carol
333
1
5
1.2
>>> df[['userID', 'day']]
userID
day
0
111
5
1
222
5
2
111
6
3
333
5
4
222
6
5
111
6
Data Types¶
Currently, DataFrames
support 4 Arkouda data types for supplying columns.
Arkouda.pdarray
Arkouda.Strings
Arkouda.Categorical
Arkouda.SegArray
Data within the above objects can be of the types below. Please Note - Not all listed types are compatible with every type above.
int64
: 64-bit signed integeruint64
: 64-bit unsigned integerfloat64
: IEEE 64-bit floating point numberbool
: 8-bit boolean valuestr
: Python string
Iteration¶
Iterating directly over a DataFrame
with for x in df
is not recommended. Doing so is discouraged because it requires transferring all array data from the arkouda server to the Python client since there is almost always a more array-oriented way to express an iterator-based computation. To force this transfer, use the to_pandas
function to return the DataFrame
as a pandas.DataFrame
. This transfer will raise an error if it exceeds the byte limit defined in ak.client.maxTransferBytes
.
- arkouda.DataFrame.to_pandas(self, datalimit=1073741824, retain_index=False)¶
Send this DataFrame to a pandas DataFrame.
- Parameters:
datalimit (int, default=arkouda.client.maxTransferBytes) – The maximum number size, in megabytes to transfer. The requested DataFrame will be converted to a pandas DataFrame only if the estimated size of the DataFrame does not exceed this value.
retain_index (bool, default=False) – Normally, to_pandas() creates a new range index object. If you want to keep the index column, set this to True.
- Returns:
The result of converting this DataFrame to a pandas DataFrame.
- Return type:
pandas.DataFrame
Examples
>>> import arkouda as ak >>> ak.connect() >>> ak_df = ak.DataFrame({"A": ak.arange(2), "B": -1 * ak.arange(2)}) >>> type(ak_df) arkouda.dataframe.DataFrame >>> display(ak_df)
A
B
0
0
0
1
1
-1
>>> import pandas as pd >>> pd_df = ak_df.to_pandas() >>> type(pd_df) pandas.core.frame.DataFrame >>> display(pd_df)
A
B
0
0
0
1
1
-1
Features¶
DataFrames
support the majority of functionality offered by pandas.DataFrame
.
Drop¶
- arkouda.DataFrame.drop(self, keys, axis=0, inplace=False)¶
Drop column/s or row/s from the dataframe.
- Parameters:
keys (str, int or list) – The labels to be dropped on the given axis.
axis (int or str) – The axis on which to drop from. 0/’index’ - drop rows, 1/’columns’ - drop columns.
inplace (bool, default=False) – When True, perform the operation on the calling object. When False, return a new object.
- Returns:
DateFrame when inplace=False; None when inplace=True
- Return type:
arkouda.dataframe.DataFrame or None
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> display(df)
col1
col2
0
1
3
1
2
4
Drop column
>>> df.drop('col1', axis = 1)
col2
0
3
1
4
Drop row
>>> df.drop(0, axis = 0)
col1
col2
0
2
4
GroupBy¶
- arkouda.DataFrame.groupby(self, keys, use_series=True, as_index=True, dropna=True)¶
Group the dataframe by a column or a list of columns. Alias for GroupBy.
- Parameters:
keys (str or list of str) – An (ordered) list of column names or a single string to group by.
use_series (bool, default=True) – If True, returns an arkouda.dataframe.DataFrameGroupBy object. Otherwise an arkouda.groupbyclass.GroupBy object.
as_index (bool, default=True) – If True, groupby columns will be set as index otherwise, the groupby columns will be treated as DataFrame columns.
dropna (bool, default=True) – If True, and the groupby keys contain NaN values, the NaN values together with the corresponding row will be dropped. Otherwise, the rows corresponding to NaN values will be kept.
- Returns:
If use_series = True, returns an arkouda.dataframe.DataFrameGroupBy object. Otherwise returns an arkouda.groupbyclass.GroupBy object.
- Return type:
arkouda.dataframe.DataFrameGroupBy or arkouda.groupbyclass.GroupBy
See also
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1.0, 1.0, 2.0, np.nan], 'col2': [4, 5, 6, 7]}) >>> df
col1
col2
0
1
4
1
1
5
2
2
6
3
nan
7
>>> df.GroupBy("col1") <arkouda.groupbyclass.GroupBy at 0x7f2cf23e10c0> >>> df.GroupBy("col1").size() (array([1.00000000000000000 2.00000000000000000]), array([2 1]))
>>> df.GroupBy("col1",use_series=True) col1 1.0 2 2.0 1 dtype: int64 >>> df.GroupBy("col1",use_series=True, as_index = False).size()
col1
size
0
1
2
1
2
1
Copy¶
- arkouda.DataFrame.copy(self, deep=True)¶
Make a copy of this object’s data.
When deep = True (default), a new object will be created with a copy of the calling object’s data. Modifications to the data of the copy will not be reflected in the original object.
When deep = False a new object will be created without copying the calling object’s data. Any changes to the data of the original object will be reflected in the shallow copy, and vice versa.
- Parameters:
deep (bool, default=True) – When True, return a deep copy. Otherwise, return a shallow copy.
- Returns:
A deep or shallow copy according to caller specification.
- Return type:
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> display(df)
col1
col2
0
1
3
1
2
4
>>> df_deep = df.copy(deep=True) >>> df_deep['col1'] +=1 >>> display(df)
col1
col2
0
1
3
1
2
4
>>> df_shallow = df.copy(deep=False) >>> df_shallow['col1'] +=1 >>> display(df)
col1
col2
0
2
3
1
3
4
Filter¶
Permutations¶
- arkouda.DataFrame.apply_permutation(self, perm)¶
Apply a permutation to an entire DataFrame. The operation is done in place and the original DataFrame will be modified.
This may be useful if you want to unsort an DataFrame, or even to apply an arbitrary permutation such as the inverse of a sorting permutation.
- Parameters:
perm (pdarray) – A permutation array. Should be the same size as the data arrays, and should consist of the integers [0,size-1] in some order. Very minimal testing is done to ensure this is a permutation.
- Return type:
None
See also
sort
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
col1
col2
0
1
4
1
2
5
2
3
6
>>> perm_arry = ak.array([0, 2, 1]) >>> df.apply_permutation(perm_arry) >>> display(df)
col1
col2
0
1
4
1
3
6
2
2
5
Sorting¶
- arkouda.DataFrame.argsort(self, key, ascending=True)¶
Return the permutation that sorts the dataframe by key.
- Parameters:
key (str) – The key to sort on.
ascending (bool, default = True) – If true, sort the key in ascending order. Otherwise, sort the key in descending order.
- Returns:
The permutation array that sorts the data on key.
- Return type:
See also
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1.1, 3.1, 2.1], 'col2': [6, 5, 4]}) >>> display(df)
col1
col2
0
1.1
6
1
3.1
5
2
2.1
4
>>> df.argsort('col1') array([0 2 1]) >>> sorted_df1 = df[df.argsort('col1')] >>> display(sorted_df1)
col1
col2
0
1.1
6
1
2.1
4
2
3.1
5
>>> df.argsort('col2') array([2 1 0]) >>> sorted_df2 = df[df.argsort('col2')] >>> display(sorted_df2)
col1
col2
0
2.1
4
1
3.1
5
2
1.1
6
- arkouda.DataFrame.coargsort(self, keys, ascending=True)¶
Return the permutation that sorts the dataframe by keys.
Note: Sorting using Strings may not yield correct sort order.
- Parameters:
keys (list of str) – The keys to sort on.
- Returns:
The permutation array that sorts the data on keys.
- Return type:
Example
>>> df = ak.DataFrame({'col1': [2, 2, 1], 'col2': [3, 4, 3], 'col3':[5, 6, 7]}) >>> display(df)
col1
col2
col3
0
2
3
5
1
2
4
6
2
1
3
7
>>> df.coargsort(['col1', 'col2']) array([2 0 1]) >>>
- arkouda.DataFrame.sort_values(self, by=None, ascending=True)¶
Sort the DataFrame by one or more columns.
If no column is specified, all columns are used.
Note: Fails on order of arkouda.strings.Strings columns when multiple columns being sorted.
- Parameters:
by (str or list/tuple of str, default = None) – The name(s) of the column(s) to sort by.
ascending (bool, default = True) – Sort values in ascending (default) or descending order.
See also
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [2, 2, 1], 'col2': [3, 4, 3], 'col3':[5, 6, 7]}) >>> display(df)
col1
col2
col3
0
2
3
5
1
2
4
6
2
1
3
7
>>> df.sort_values()
col1
col2
col3
0
1
3
7
1
2
3
5
2
2
4
6
>>> df.sort_values("col3")
col1
col2
col3
0
1
3
7
1
2
3
5
2
2
4
6
Tail/Head of Data¶
- arkouda.DataFrame.tail(self, n=5)¶
Return the last n rows.
This function returns the last n rows for the dataframe. It is useful for quickly testing if your object has the right type of data in it.
- Parameters:
n (int, default=5) – Number of rows to select.
- Returns:
The last n rows of the DataFrame.
- Return type:
See also
arkouda.dataframe.head
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': ak.arange(10), 'col2': -1 * ak.arange(10)}) >>> display(df)
col1
col2
0
0
0
1
1
-1
2
2
-2
3
3
-3
4
4
-4
5
5
-5
6
6
-6
7
7
-7
8
8
-8
9
9
-9
>>> df.tail()
col1
col2
0
5
-5
1
6
-6
2
7
-7
3
8
-8
4
9
-9
>>> df.tail(n=2)
col1
col2
0
8
-8
1
9
-9
- arkouda.DataFrame.head(self, n=5)¶
Return the first n rows.
This function returns the first n rows of the the dataframe. It is useful for quickly verifying data, for example, after sorting or appending rows.
- Parameters:
n (int, default = 5) – Number of rows to select.
- Returns:
The first n rows of the DataFrame.
- Return type:
See also
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': ak.arange(10), 'col2': -1 * ak.arange(10)}) >>> display(df)
col1
col2
0
0
0
1
1
-1
2
2
-2
3
3
-3
4
4
-4
5
5
-5
6
6
-6
7
7
-7
8
8
-8
9
9
-9
>>> df.head()
col1
col2
0
0
0
1
1
-1
2
2
-2
3
3
-3
4
4
-4
>>> df.head(n=2)
col1
col2
0
0
0
1
1
-1
Rename Columns¶
- arkouda.DataFrame.rename(self, mapper=None, index=None, column=None, axis=0, inplace=False)¶
Rename indexes or columns according to a mapping.
- Parameters:
mapper (callable or dict-like, Optional) – Function or dictionary mapping existing values to new values. Nonexistent names will not raise an error. Uses the value of axis to determine if renaming column or index
column (callable or dict-like, Optional) – Function or dictionary mapping existing column names to new column names. Nonexistent names will not raise an error. When this is set, axis is ignored.
index (callable or dict-like, Optional) – Function or dictionary mapping existing index names to new index names. Nonexistent names will not raise an error. When this is set, axis is ignored.
axis (int or str, default=0) – Indicates which axis to perform the rename. 0/”index” - Indexes 1/”column” - Columns
inplace (bool, default=False) – When True, perform the operation on the calling object. When False, return a new object.
- Returns:
DateFrame when inplace=False; None when inplace=True.
- Return type:
arkouda.dataframe.DataFrame or None
Examples
>>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({"A": ak.array([1, 2, 3]), "B": ak.array([4, 5, 6])}) >>> display(df)
A
B
0
1
4
1
2
5
2
3
6
Rename columns using a mapping:
>>> df.rename(column={'A':'a', 'B':'c'})
a
c
0
1
4
1
2
5
2
3
6
Rename indexes using a mapping:
>>> df.rename(index={0:99, 2:11})
A
B
0
1
4
1
2
5
2
3
6
Rename using an axis style parameter:
>>> df.rename(str.lower, axis='column')
a
b
0
1
4
1
2
5
2
3
6
Append¶
Concatenate¶
- arkouda.DataFrame.concat(items, ordered=True)¶
Essentially an append, but different formatting.
Reset Indexes¶
- arkouda.DataFrame.reset_index(self, size=None, inplace=False)¶
Set the index to an integer range.
Useful if this dataframe is the result of a slice operation from another dataframe, or if you have permuted the rows and no longer need to keep that ordering on the rows.
- Parameters:
size (int, optional) – If size is passed, do not attempt to determine size based on existing column sizes. Assume caller handles consistency correctly.
inplace (bool, default=False) – When True, perform the operation on the calling object. When False, return a new object.
- Returns:
DateFrame when inplace=False; None when inplace=True.
- Return type:
arkouda.dataframe.DataFrame or None
Note
Pandas adds a column ‘index’ to indicate the original index. Arkouda does not currently support this behavior.
Example
>>> df = ak.DataFrame({"A": ak.array([1, 2, 3]), "B": ak.array([4, 5, 6])}) >>> display(df)
A
B
0
1
4
1
2
5
2
3
6
>>> perm_df = df[ak.array([0,2,1])] >>> display(perm_df)
A
B
0
1
4
1
3
6
2
2
5
>>> perm_df.reset_index()
A
B
0
1
4
1
3
6
2
2
5
Deduplication¶
- arkouda.DataFrame.drop_duplicates(self, subset=None, keep='first')¶
Drops duplcated rows and returns resulting DataFrame.
If a subset of the columns are provided then only one instance of each duplicated row will be returned (keep determines which row).
- Parameters:
subset (Iterable) – Iterable of column names to use to dedupe.
keep ({'first', 'last'}, default='first') – Determines which duplicates (if any) to keep.
- Returns:
DataFrame with duplicates removed.
- Return type:
Example
>>> df = ak.DataFrame({'col1': [1, 2, 2, 3], 'col2': [4, 5, 5, 6]}) >>> display(df)
col1
col2
0
1
4
1
2
5
2
2
5
3
3
6
>>> df.drop_duplicates()
col1
col2
0
1
4
1
2
5
2
3
6