DataFrames in Arkouda¶

Like Pandas, Arkouda supports DataFrames. The purpose and intended functionality remains the same in Arkouda, but are configured to be based on arkouda.pdarrays.

class arkouda.DataFrame(initialdata=None, index=None, columns=None)[source]¶

A DataFrame structure based on arkouda arrays.

Parameters:

initialdata (List or dictionary of lists, tuples, or pdarrays) – Each list/dictionary entry corresponds to one column of the data and should be a homogenous type. Different columns may have different types. If using a dictionary, keys should be strings.
index (Index, pdarray, or Strings) – Index for the resulting frame. Defaults to an integer range.
columns (List, tuple, pdarray, or Strings) – Column labels to use if the data does not include them. Elements must be strings. Defaults to an stringified integer range.

Examples

>>> import arkouda as ak

Create an empty DataFrame and add a column of data: >>> import arkouda as ak >>> df = ak.DataFrame() >>> df[‘a’] = ak.array([1,2,3]) >>> df

a

0 1 1 2 2 3 (3 rows x 1 columns)

Create a new DataFrame using a dictionary of data:

>>> userName = ak.array(['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice'])
>>> userID = ak.array([111, 222, 111, 333, 222, 111])
>>> item = ak.array([0, 0, 1, 1, 2, 0])
>>> day = ak.array([5, 5, 6, 5, 6, 6])
>>> amount = ak.array([0.5, 0.6, 1.1, 1.2, 4.3, 0.6])
>>> df = ak.DataFrame({
...     'userName': userName,
...     'userID': userID,
...     'item': item,
...     'day': day,
...     'amount': amount
... })
>>> df
  userName  userID  item  day  amount
0    Alice     111     0    5     0.5
1      Bob     222     0    5     0.6
2    Alice     111     1    6     1.1
3    Carol     333     1    5     1.2
4      Bob     222     2    6     4.3
5    Alice     111     0    6     0.6 (6 rows x 5 columns)

Indexing works slightly differently than with pandas: >>> df[0] {‘userName’: np.str_(‘Alice’), ‘userID’: np.int64(111), ‘item’: np.int64(0), ‘day’: np.int64(5), ‘amount’: np.float64(0.5)} >>> df[‘userID’] array([111 222 111 333 222 111])

>>> df['userName']
array(['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice'])

>>> df[ak.array([1,3,5])]
  userName  userID  item  day  amount
1      Bob     222     0    5     0.6
3    Carol     333     1    5     1.2
5    Alice     111     0    6     0.6 (3 rows x 5 columns)

Compute the stride: >>> df[1:5:1]

userName userID item day amount

1 Bob 222 0 5 0.6 2 Alice 111 1 6 1.1 3 Carol 333 1 5 1.2 4 Bob 222 2 6 4.3 (4 rows x 5 columns)

>>> df[ak.array([1,2,3])]
  userName  userID  item  day  amount
1      Bob     222     0    5     0.6
2    Alice     111     1    6     1.1
3    Carol     333     1    5     1.2 (3 rows x 5 columns)

>>> df[['userID', 'day']]
   userID  day
   111    5
   222    5
   111    6
   333    5
   222    6
   111    6 (6 rows x 2 columns)

Data Types¶

Currently, DataFrames support 4 Arkouda data types for supplying columns.

Arkouda.pdarray
arkouda.numpy.Strings
Arkouda.Categorical
arkouda.numpy.SegArray

Data within the above objects can be of the types below. Please Note - Not all listed types are compatible with every type above.

int64: 64-bit signed integer
uint64: 64-bit unsigned integer
float64: IEEE 64-bit floating point number
bool: 8-bit boolean value
str: Python string

Iteration¶

Iterating directly over a DataFrame with for x in df is not recommended. Doing so is discouraged because it requires transferring all array data from the arkouda server to the Python client since there is almost always a more array-oriented way to express an iterator-based computation. To force this transfer, use the to_pandas function to return the DataFrame as a pandas.DataFrame. This transfer will raise an error if it exceeds the byte limit defined in ak.client.maxTransferBytes.

arkouda.DataFrame.to_pandas(self, datalimit=1073741824, retain_index=False)¶

Send this DataFrame to a pandas DataFrame.

Parameters:

datalimit (int, default=arkouda.client.maxTransferBytes) – The maximum number size, in megabytes to transfer. The requested DataFrame will be converted to a pandas DataFrame only if the estimated size of the DataFrame does not exceed this value.
retain_index (bool, default=False) – Normally, to_pandas() creates a new range index object. If you want to keep the index column, set this to True.

Returns:

The result of converting this DataFrame to a pandas DataFrame.

Return type:

pandas.DataFrame

Examples

>>> import arkouda as ak
>>> ak_df = ak.DataFrame({"A": ak.arange(2), "B": -1 * ak.arange(2)})
>>> type(ak_df)
<class 'arkouda.dataframe.DataFrame'>
>>> ak_df
   A  B
0  0  0
1  1 -1 (2 rows x 2 columns)

>>> import pandas as pd
>>> pd_df = ak_df.to_pandas()
>>> type(pd_df)
<class 'pandas.core.frame.DataFrame'>
>>> pd_df
   A  B
0  0  0
1  1 -1

Features¶

DataFrames support the majority of functionality offered by pandas.DataFrame.

Drop¶

arkouda.DataFrame.drop(self, keys, axis=0, inplace=False)¶

Drop column/s or row/s from the dataframe.

Parameters:

keys (str, int or list) – The labels to be dropped on the given axis.
axis (int or str) – The axis on which to drop from. 0/’index’ - drop rows, 1/’columns’ - drop columns.
inplace (bool, default=False) – When True, perform the operation on the calling object. When False, return a new object.

Returns:

DateFrame when inplace=False; None when inplace=True

Return type:

DataFrame or None

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> df
   col1  col2
0     1     3
1     2     4 (2 rows x 2 columns)

Drop column >>> df.drop(‘col1’, axis = 1)

col2

0 3 1 4 (2 rows x 1 columns)

Drop row >>> df.drop(0, axis = 0)

col1 col2

1 2 4 (1 rows x 2 columns)

GroupBy¶

arkouda.DataFrame.groupby(self, keys, use_series=True, as_index=True, dropna=True)¶

Group the dataframe by a column or a list of columns.

Alias for GroupBy.

Parameters:

keys (str or list of str) – An (ordered) list of column names or a single string to group by.
use_series (bool, default=True) – If True, returns an arkouda.pandas.dataframe.DataFrameGroupBy object. Otherwise an arkouda.pandas.groupbyclass.GroupBy object.
as_index (bool, default=True) – If True, groupby columns will be set as index otherwise, the groupby columns will be treated as DataFrame columns.
dropna (bool, default=True) – If True, and the groupby keys contain NaN values, the NaN values together with the corresponding row will be dropped. Otherwise, the rows corresponding to NaN values will be kept.

Returns:

If use_series = True, returns an arkouda.pandas.dataframe.DataFrameGroupBy object. Otherwise returns an arkouda.pandas.groupbyclass.GroupBy object.

Return type:

arkouda.pandas.dataframe.DataFrameGroupBy or arkouda.pandas.groupbyclass.GroupBy

See also

arkouda.GroupBy

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [1.0, 1.0, 2.0, np.nan], 'col2': [4, 5, 6, 7]})
>>> df
   col1  col2
0   1.0     4
1   1.0     5
2   2.0     6
3   NaN     7 (4 rows x 2 columns)

>>> df.GroupBy("col1")
<arkouda.groupbyclass.GroupBy object at 0x795584773f00>
>>> df.GroupBy("col1").size()
(array([1.00000000000000000 2.00000000000000000]), array([2 1]))

>>> df.GroupBy("col1",use_series=True).size()
col1
1.0    2
2.0    1
dtype: int64
>>> df.GroupBy("col1",use_series=True, as_index = False).size()
   col1  size
0   1.0     2
1   2.0     1 (2 rows x 2 columns)

Copy¶

arkouda.DataFrame.copy(self, deep=True)¶

Make a copy of this object’s data.

When deep = True (default), a new object will be created with a copy of the calling object’s data. Modifications to the data of the copy will not be reflected in the original object.

When deep = False a new object will be created without copying the calling object’s data. Any changes to the data of the original object will be reflected in the shallow copy, and vice versa.

Parameters:: deep (bool, default=True) – When True, return a deep copy. Otherwise, return a shallow copy.
Returns:: A deep or shallow copy according to caller specification.
Return type:: DataFrame

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> df
   col1  col2
0     1     3
1     2     4 (2 rows x 2 columns)

>>> df_deep = df.copy(deep=True)
>>> df_deep['col1'] +=1
>>> df
   col1  col2
0     1     3
1     2     4 (2 rows x 2 columns)

>>> df_shallow = df.copy(deep=False)
>>> df_shallow['col1'] +=1
>>> df
   col1  col2
0     2     3
1     3     4 (2 rows x 2 columns)

Filter¶

Permutations¶

arkouda.DataFrame.apply_permutation(self, perm)¶

Apply a permutation to an entire DataFrame.

The operation is done in place and the original DataFrame will be modified.

This may be useful if you want to unsort an DataFrame, or even to apply an arbitrary permutation such as the inverse of a sorting permutation.

Parameters:: perm (pdarray) – A permutation array. Should be the same size as the data arrays, and should consist of the integers [0,size-1] in some order. Very minimal testing is done to ensure this is a permutation.

See also

sort

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df
   col1  col2
0     1     4
1     2     5
2     3     6 (3 rows x 2 columns)

>>> perm_arry = ak.array([0, 2, 1])
>>> df.apply_permutation(perm_arry)
>>> df
   col1  col2
0     1     4
2     3     6
1     2     5 (3 rows x 2 columns)

Sorting¶

arkouda.DataFrame.argsort(self, key, ascending=True)¶

Return the permutation that sorts the dataframe by key.

Parameters:

key (str) – The key to sort on.
ascending (bool, default = True) – If true, sort the key in ascending order. Otherwise, sort the key in descending order.

Returns:

The permutation array that sorts the data on key.

Return type:

arkouda.numpy.pdarrayclass.pdarray

See also

coargsort

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [1.1, 3.1, 2.1], 'col2': [6, 5, 4]})
>>> df
   col1  col2
0   1.1     6
1   3.1     5
2   2.1     4 (3 rows x 2 columns)

>>> df.argsort('col1')
array([0 2 1])
>>> sorted_df1 = df[df.argsort('col1')]
>>> sorted_df1
   col1  col2
0   1.1     6
2   2.1     4
1   3.1     5 (3 rows x 2 columns)

>>> df.argsort('col2')
array([2 1 0])
>>> sorted_df2 = df[df.argsort('col2')]
>>> sorted_df2
   col1  col2
2   2.1     4
1   3.1     5
0   1.1     6 (3 rows x 2 columns)

arkouda.DataFrame.coargsort(self, keys, ascending=True)¶

Return the permutation that sorts the dataframe by keys.

Note: Sorting using Strings may not yield correct sort order.

Parameters:: keys (list of str) – The keys to sort on.
Returns:: The permutation array that sorts the data on keys.
Return type:: arkouda.numpy.pdarrayclass.pdarray

Example

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [2, 2, 1], 'col2': [3, 4, 3], 'col3':[5, 6, 7]})
>>> df
   col1  col2  col3
0     2     3     5
1     2     4     6
2     1     3     7 (3 rows x 3 columns)

>>> df.coargsort(['col1', 'col2'])
array([2 0 1])
>>>

arkouda.DataFrame.sort_values(self, by=None, ascending=True)¶

Sort the DataFrame by one or more columns.

If no column is specified, all columns are used.

Note: Fails on order of arkouda.numpy.strings.Strings columns when multiple columns being sorted.

Parameters:

by (str or list/tuple of str, default = None) – The name(s) of the column(s) to sort by.
ascending (bool, default = True) – Sort values in ascending (default) or descending order.

See also

apply_permutation

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [2, 2, 1], 'col2': [3, 4, 3], 'col3':[5, 6, 7]})
>>> df
   col1  col2  col3
0     2     3     5
1     2     4     6
2     1     3     7 (3 rows x 3 columns)

>>> df.sort_values()
   col1  col2  col3
2     1     3     7
0     2     3     5
1     2     4     6 (3 rows x 3 columns)

>>> df.sort_values("col3")
   col1  col2  col3
0     2     3     5
1     2     4     6
2     1     3     7 (3 rows x 3 columns)

Tail/Head of Data¶

arkouda.DataFrame.tail(self, n=5)¶

Return the last n rows.

This function returns the last n rows for the dataframe. It is useful for quickly testing if your object has the right type of data in it.

Parameters:: n (int, default=5) – Number of rows to select.
Returns:: The last n rows of the DataFrame.
Return type:: DataFrame

See also

arkouda.pandas.dataframe.head

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': ak.arange(10), 'col2': -1 * ak.arange(10)})
>>> df
   col1  col2
   0     0
   1    -1
   2    -2
   3    -3
   4    -4
   5    -5
   6    -6
   7    -7
   8    -8
   9    -9 (10 rows x 2 columns)

>>> df.tail()
   col1  col2
   5    -5
   6    -6
   7    -7
   8    -8
   9    -9 (5 rows x 2 columns)

>>> df.tail(n=2)
   col1  col2
8     8    -8
9     9    -9 (2 rows x 2 columns)

arkouda.DataFrame.head(self, n=5)¶

Return the first n rows.

This function returns the first n rows of the the dataframe. It is useful for quickly verifying data, for example, after sorting or appending rows.

Parameters:: n (int, default = 5) – Number of rows to select.
Returns:: The first n rows of the DataFrame.
Return type:: DataFrame

See also

tail

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': ak.arange(10), 'col2': -1 * ak.arange(10)})
>>> df
   col1  col2
   0     0
   1    -1
   2    -2
   3    -3
   4    -4
   5    -5
   6    -6
   7    -7
   8    -8
   9    -9 (10 rows x 2 columns)

>>> df.head()
   col1  col2
   0     0
   1    -1
   2    -2
   3    -3
   4    -4 (5 rows x 2 columns)

>>> df.head(n=2)
   col1  col2
0     0     0
1     1    -1 (2 rows x 2 columns)

Rename Columns¶

arkouda.DataFrame.rename(self, mapper=None, index=None, column=None, axis=0, inplace=False)¶

Rename indexes or columns according to a mapping.

Parameters:

mapper (callable or dict-like, Optional) – Function or dictionary mapping existing values to new values. Nonexistent names will not raise an error. Uses the value of axis to determine if renaming column or index
column (callable or dict-like, Optional) – Function or dictionary mapping existing column names to new column names. Nonexistent names will not raise an error. When this is set, axis is ignored.
index (callable or dict-like, Optional) – Function or dictionary mapping existing index names to new index names. Nonexistent names will not raise an error. When this is set, axis is ignored.
axis (int or str, default=0) – Indicates which axis to perform the rename. 0/”index” - Indexes 1/”column” - Columns
inplace (bool, default=False) – When True, perform the operation on the calling object. When False, return a new object.

Returns:

DateFrame when inplace=False; None when inplace=True.

Return type:

DataFrame or None

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({"A": ak.array([1, 2, 3]), "B": ak.array([4, 5, 6])})
>>> df
   A  B
0  1  4
1  2  5
2  3  6 (3 rows x 2 columns)

Rename columns using a mapping: >>> df.rename(column={‘A’:’a’, ‘B’:’c’})

a c

0 1 4 1 2 5 2 3 6 (3 rows x 2 columns)

Rename indexes using a mapping: >>> df.rename(index={0:99, 2:11})

A B

99 1 4 1 2 5 11 3 6 (3 rows x 2 columns)

Rename using an axis style parameter: >>> df.rename(str.lower, axis=’column’)

a b

0 1 4 1 2 5 2 3 6 (3 rows x 2 columns)

Append¶

Concatenate¶

arkouda.DataFrame.concat(items, ordered=True)¶: Essentially an append, but different formatting.

Reset Indexes¶

arkouda.DataFrame.reset_index(self, size=None, inplace=False)¶

Set the index to an integer range.

Useful if this dataframe is the result of a slice operation from another dataframe, or if you have permuted the rows and no longer need to keep that ordering on the rows.

Parameters:

size (int, optional) – If size is passed, do not attempt to determine size based on existing column sizes. Assume caller handles consistency correctly.
inplace (bool, default=False) – When True, perform the operation on the calling object. When False, return a new object.

Returns:

DateFrame when inplace=False; None when inplace=True.

Return type:

DataFrame or None

Note

Pandas adds a column ‘index’ to indicate the original index. Arkouda does not currently support this behavior.

Example

>>> import arkouda as ak
>>> df = ak.DataFrame({"A": ak.array([1, 2, 3]), "B": ak.array([4, 5, 6])})
>>> df
   A  B
0  1  4
1  2  5
2  3  6 (3 rows x 2 columns)

>>> perm_df = df[ak.array([0,2,1])]
>>> perm_df
   A  B
0  1  4
2  3  6
1  2  5 (3 rows x 2 columns)

>>> perm_df.reset_index()
   A  B
0  1  4
1  3  6
2  2  5 (3 rows x 2 columns)

Deduplication¶

arkouda.DataFrame.drop_duplicates(self, subset=None, keep='first')¶

Drop duplcated rows and returns resulting DataFrame.

If a subset of the columns are provided then only one instance of each duplicated row will be returned (keep determines which row).

Parameters:

subset (Iterable) – Iterable of column names to use to dedupe.
keep ({'first', 'last'}, default='first') – Determines which duplicates (if any) to keep.

Returns:

DataFrame with duplicates removed.

Return type:

DataFrame

Example

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [1, 2, 2, 3], 'col2': [4, 5, 5, 6]})
>>> df
   col1  col2
0     1     4
1     2     5
2     2     5
3     3     6 (4 rows x 2 columns)

>>> df.drop_duplicates()
   col1  col2
0     1     4
1     2     5
3     3     6 (3 rows x 2 columns)