arkouda.dataframe

Module Contents

Classes

DataFrame

A DataFrame structure based on arkouda arrays.

DiffAggregate

A column in a GroupBy that has been differenced.

Functions

intersect(a, b[, positions, unique])

Find the intersection of two arkouda arrays.

intx(a, b)

Find all the rows that are in both dataframes.

invert_permutation(perm)

Find the inverse of a permutation array.

merge(→ DataFrame)

Merge Arkouda DataFrames with a database-style join.

class arkouda.dataframe.DataFrame(initialdata=None, index=None, columns=None)[source]

Bases: collections.UserDict

A DataFrame structure based on arkouda arrays.

Parameters:
  • initialdata (List or dictionary of lists, tuples, or pdarrays) – Each list/dictionary entry corresponds to one column of the data and should be a homogenous type. Different columns may have different types. If using a dictionary, keys should be strings.

  • index (Index, pdarray, or Strings) – Index for the resulting frame. Defaults to an integer range.

  • columns (List, tuple, pdarray, or Strings) – Column labels to use if the data does not include them. Elements must be strings. Defaults to an stringified integer range.

Examples

Create an empty DataFrame and add a column of data:

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame()
>>> df['a'] = ak.array([1,2,3])
>>> display(df)

a

0

1

1

2

2

3

Create a new DataFrame using a dictionary of data:

>>> userName = ak.array(['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice'])
>>> userID = ak.array([111, 222, 111, 333, 222, 111])
>>> item = ak.array([0, 0, 1, 1, 2, 0])
>>> day = ak.array([5, 5, 6, 5, 6, 6])
>>> amount = ak.array([0.5, 0.6, 1.1, 1.2, 4.3, 0.6])
>>> df = ak.DataFrame({'userName': userName, 'userID': userID,
>>>            'item': item, 'day': day, 'amount': amount})
>>> display(df)

userName

userID

item

day

amount

0

Alice

111

0

5

0.5

1

Bob

222

0

5

0.6

2

Alice

111

1

6

1.1

3

Carol

333

1

5

1.2

4

Bob

222

2

6

4.3

5

Alice

111

0

6

0.6

Indexing works slightly differently than with pandas:

>>> df[0]

keys

values

userName

Alice

userID

111

item

0

day

5

amount

0.5

>>> df['userID']
array([111, 222, 111, 333, 222, 111])
>>> df['userName']
array(['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice'])
>>> df[ak.array([1,3,5])]

userName

userID

item

day

amount

0

Bob

222

0

5

0.6

1

Carol

333

1

5

1.2

2

Alice

111

0

6

0.6

Compute the stride:

>>> df[1:5:1]

userName

userID

item

day

amount

0

Bob

222

0

5

0.6

1

Alice

111

1

6

1.1

2

Carol

333

1

5

1.2

3

Bob

222

2

6

4.3

>>> df[ak.array([1,2,3])]

userName

userID

item

day

amount

0

Bob

222

0

5

0.6

1

Alice

111

1

6

1.1

2

Carol

333

1

5

1.2

>>> df[['userID', 'day']]

userID

day

0

111

5

1

222

5

2

111

6

3

333

5

4

222

6

5

111

6

property columns

An Index where the values are the column names of the dataframe.

Returns:

The values of the index are the column names of the dataframe.

Return type:

arkouda.index.Index

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> df

col1

col2

0

1

3

1

2

4

>>> df.columns
Index(array(['col1', 'col2']), dtype='<U0')
property dtypes

The dtypes of the dataframe.

Returns:

dtypes – The dtypes of the dataframe.

Return type:

arkouda.row.Row

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1, 2], 'col2': ["a", "b"]})
>>> df

col1

col2

0

1

a

1

2

b

>>> df.dtypes

keys

values

col1

int64

col2

str

property empty

Whether the dataframe is empty.

Returns:

True if the dataframe is empty, otherwise False.

Return type:

bool

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({})
>>> df
 0 rows x 0 columns
>>> df.empty
True
property index

The index of the dataframe.

Returns:

The index of the dataframe.

Return type:

arkouda.index.Index or arkouda.index.MultiIndex

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> df

col1

col2

0

1

3

1

2

4

>>> df.index
Index(array([0 1]), dtype='int64')
property info

Returns a summary string of this dataframe.

Returns:

A summary string of this dataframe.

Return type:

str

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1, 2], 'col2': ["a", "b"]})
>>> df

col1

col2

0

1

a

1

2

b

>>> df.info
"DataFrame(['col1', 'col2'], 2 rows, 20 B)"
property shape

The shape of the dataframe.

Returns:

Tuple of array dimensions.

Return type:

tuple of int

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df

col1

col2

0

1

4

1

2

5

2

3

6

>>> df.shape
(3, 2)
property size

Returns the number of bytes on the arkouda server.

Returns:

The number of bytes on the arkouda server.

Return type:

int

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df

col1

col2

0

1

4

1

2

5

2

3

6

>>> df.size
6
objType = 'DataFrame'
GroupBy(keys, use_series=False, as_index=True, dropna=True)[source]

Group the dataframe by a column or a list of columns.

Parameters:
  • keys (str or list of str) – An (ordered) list of column names or a single string to group by.

  • use_series (bool, default=False) – If True, returns an arkouda.dataframe.GroupBy object. Otherwise an arkouda.groupbyclass.GroupBy object.

  • as_index (bool, default=True) – If True, groupby columns will be set as index otherwise, the groupby columns will be treated as DataFrame columns.

  • dropna (bool, default=True) – If True, and the groupby keys contain NaN values, the NaN values together with the corresponding row will be dropped. Otherwise, the rows corresponding to NaN values will be kept.

Returns:

If use_series = True, returns an arkouda.dataframe.GroupBy object. Otherwise returns an arkouda.groupbyclass.GroupBy object.

Return type:

arkouda.dataframe.GroupBy or arkouda.groupbyclass.GroupBy

See also

arkouda.GroupBy

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1.0, 1.0, 2.0, np.nan], 'col2': [4, 5, 6, 7]})
>>> df

col1

col2

0

1

4

1

1

5

2

2

6

3

nan

7

>>> df.GroupBy("col1")
<arkouda.groupbyclass.GroupBy at 0x7f2cf23e10c0>
>>> df.GroupBy("col1").size()
(array([1.00000000000000000 2.00000000000000000]), array([2 1]))
>>> df.GroupBy("col1",use_series=True)
col1
1.0    2
2.0    1
dtype: int64
>>> df.GroupBy("col1",use_series=True, as_index = False).size()

col1

size

0

1

2

1

2

1

all(axis=0) arkouda.series.Series | bool[source]

Return whether all elements are True, potentially over an axis.

Returns True unless there at least one element along a Dataframe axis that is False.

Currently, will ignore any columns that are not type bool. This is equivalent to the pandas option bool_only=True.

Parameters:

axis ({0 or ‘index’, 1 or ‘columns’, None}, default = 0) –

Indicate which axis or axes should be reduced.

0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.

1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.

None : reduce all axes, return a scalar.

Return type:

arkouda.series.Series or bool

Raises:

ValueError – Raised if axis does not have a value in {0 or ‘index’, 1 or ‘columns’, None}.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({"A":[True,True,True,False],"B":[True,True,True,False],
...          "C":[True,False,True,False],"D":[True,True,True,True]})

A

B

C

D

0

True

True

True

True

1

True

True

False

True

2

True

True

True

True

3

False

False

False

True

>>> df.all(axis=0)
A    False
B    False
C    False
D     True
dtype: bool
>>> df.all(axis=1)
0     True
1    False
2     True
3    False
dtype: bool
>>> df.all(axis=None)
False
any(axis=0) arkouda.series.Series | bool[source]

Return whether any element is True, potentially over an axis.

Returns False unless there is at least one element along a Dataframe axis that is True.

Currently, will ignore any columns that are not type bool. This is equivalent to the pandas option bool_only=True.

Parameters:

axis ({0 or ‘index’, 1 or ‘columns’, None}, default = 0) –

Indicate which axis or axes should be reduced.

0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.

1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.

None : reduce all axes, return a scalar.

Return type:

arkouda.series.Series or bool

Raises:

ValueError – Raised if axis does not have a value in {0 or ‘index’, 1 or ‘columns’, None}.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({"A":[True,True,True,False],"B":[True,True,True,False],
...          "C":[True,False,True,False],"D":[False,False,False,False]})

A

B

C

D

0

True

True

True

False

1

True

True

False

False

2

True

True

True

False

3

False

False

False

False

>>> df.any(axis=0)
A     True
B     True
C     True
D    False
dtype: bool
>>> df.any(axis=1)
0     True
1     True
2     True
3    False
dtype: bool
>>> df.any(axis=None)
True
append(other, ordered=True)[source]

Concatenate data from ‘other’ onto the end of this DataFrame, in place.

Explicitly, use the arkouda concatenate function to append the data from each column in other to the end of self. This operation is done in place, in the sense that the underlying pdarrays are updated from the result of the arkouda concatenate function, rather than returning a new DataFrame object containing the result.

Parameters:
  • other (DataFrame) – The DataFrame object whose data will be appended to this DataFrame.

  • ordered (bool, default=True) – If False, allow rows to be interleaved for better performance (but data within a row remains together). By default, append all rows to the end, in input order.

Returns:

Appending occurs in-place, but result is returned for compatibility.

Return type:

self

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df1 = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]})

col1

col2

0

1

3

1

2

4

>>> df2 = ak.DataFrame({'col1': [3], 'col2': [5]})

col1

col2

0

3

5

>>> df1.append(df2)
>>> df1

col1

col2

0

1

3

1

2

4

2

3

5

apply_permutation(perm)[source]

Apply a permutation to an entire DataFrame. The operation is done in place and the original DataFrame will be modified.

This may be useful if you want to unsort an DataFrame, or even to apply an arbitrary permutation such as the inverse of a sorting permutation.

Parameters:

perm (pdarray) – A permutation array. Should be the same size as the data arrays, and should consist of the integers [0,size-1] in some order. Very minimal testing is done to ensure this is a permutation.

Return type:

None

See also

sort

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})

col1

col2

0

1

4

1

2

5

2

3

6

>>> perm_arry = ak.array([0, 2, 1])
>>> df.apply_permutation(perm_arry)
>>> display(df)

col1

col2

0

1

4

1

3

6

2

2

5

argsort(key, ascending=True)[source]

Return the permutation that sorts the dataframe by key.

Parameters:
  • key (str) – The key to sort on.

  • ascending (bool, default = True) – If true, sort the key in ascending order. Otherwise, sort the key in descending order.

Returns:

The permutation array that sorts the data on key.

Return type:

arkouda.pdarrayclass.pdarray

See also

coargsort

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1.1, 3.1, 2.1], 'col2': [6, 5, 4]})
>>> display(df)

col1

col2

0

1.1

6

1

3.1

5

2

2.1

4

>>> df.argsort('col1')
array([0 2 1])
>>> sorted_df1 = df[df.argsort('col1')]
>>> display(sorted_df1)

col1

col2

0

1.1

6

1

2.1

4

2

3.1

5

>>> df.argsort('col2')
array([2 1 0])
>>> sorted_df2 = df[df.argsort('col2')]
>>> display(sorted_df2)

col1

col2

0

2.1

4

1

3.1

5

2

1.1

6

static attach(user_defined_name: str) DataFrame[source]

Function to return a DataFrame object attached to the registered name in the arkouda server which was registered using register().

Parameters:

user_defined_name (str) – user defined name which DataFrame object was registered under.

Returns:

The DataFrame object created by re-attaching to the corresponding server components.

Return type:

arkouda.dataframe.DataFrame

Raises:

RegistrationError – if user_defined_name is not registered

Example

>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df.register("my_table_name")
>>> df.attach("my_table_name")
>>> df.is_registered()
True
>>> df.unregister()
>>> df.is_registered()
False
coargsort(keys, ascending=True)[source]

Return the permutation that sorts the dataframe by keys.

Note: Sorting using Strings may not yield correct sort order.

Parameters:

keys (list of str) – The keys to sort on.

Returns:

The permutation array that sorts the data on keys.

Return type:

arkouda.pdarrayclass.pdarray

Example

>>> df = ak.DataFrame({'col1': [2, 2, 1], 'col2': [3, 4, 3], 'col3':[5, 6, 7]})
>>> display(df)

col1

col2

col3

0

2

3

5

1

2

4

6

2

1

3

7

>>> df.coargsort(['col1', 'col2'])
array([2 0 1])
>>>
classmethod concat(items, ordered=True)[source]

Essentially an append, but different formatting.

copy(deep=True)[source]

Make a copy of this object’s data.

When deep = True (default), a new object will be created with a copy of the calling object’s data. Modifications to the data of the copy will not be reflected in the original object.

When deep = False a new object will be created without copying the calling object’s data. Any changes to the data of the original object will be reflected in the shallow copy, and vice versa.

Parameters:

deep (bool, default=True) – When True, return a deep copy. Otherwise, return a shallow copy.

Returns:

A deep or shallow copy according to caller specification.

Return type:

arkouda.dataframe.DataFrame

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> display(df)

col1

col2

0

1

3

1

2

4

>>> df_deep = df.copy(deep=True)
>>> df_deep['col1'] +=1
>>> display(df)

col1

col2

0

1

3

1

2

4

>>> df_shallow = df.copy(deep=False)
>>> df_shallow['col1'] +=1
>>> display(df)

col1

col2

0

2

3

1

3

4

corr() DataFrame[source]

Return new DataFrame with pairwise correlation of columns.

Returns:

Arkouda DataFrame containing correlation matrix of all columns.

Return type:

arkouda.dataframe.DataFrame

Raises:

RuntimeError – Raised if there’s a server-side error thrown.

See also

pdarray.corr

Notes

Generates the correlation matrix using Pearson R for all columns.

Attempts to convert to numeric values where possible for inclusion in the matrix.

Example

>>> df = ak.DataFrame({'col1': [1, 2], 'col2': [-1, -2]})
>>> display(df)

col1

col2

0

1

-1

1

2

-2

>>> corr = df.corr()

col1

col2

col1

1

-1

col2

-1

1

count(axis: int | str = 0, numeric_only=False) arkouda.series.Series[source]

Count non-NA cells for each column or row.

The values np.NaN are considered NA.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row.

  • numeric_only (bool = False) – Include only float, int or boolean data.

Returns:

For each column/row the number of non-NA/null entries.

Return type:

arkouda.series.Series

Raises:

ValueError – Raised if axis is not 0, 1, ‘index’, or ‘columns’.

See also

GroupBy.count

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import numpy as np
>>> df = ak.DataFrame({'col_A': ak.array([7, np.nan]), 'col_B':ak.array([1, 9])})
>>> display(df)

col_A

col_B

0

7

1

1

nan

9

>>> df.count()
col_A    1
col_B    2
dtype: int64
>>> df = ak.DataFrame({'col_A': ak.array(["a","b","c"]), 'col_B':ak.array([1, np.nan, np.nan])})
>>> display(df)

col_A

col_B

0

a

1

1

b

nan

2

c

nan

>>> df.count()
col_A    3
col_B    1
dtype: int64
>>> df.count(numeric_only=True)
col_B    1
dtype: int64
>>> df.count(axis=1)
0    2
1    1
2    1
dtype: int64
drop(keys: str | int | List[str | int], axis: str | int = 0, inplace: bool = False) None | DataFrame[source]

Drop column/s or row/s from the dataframe.

Parameters:
  • keys (str, int or list) – The labels to be dropped on the given axis.

  • axis (int or str) – The axis on which to drop from. 0/’index’ - drop rows, 1/’columns’ - drop columns.

  • inplace (bool, default=False) – When True, perform the operation on the calling object. When False, return a new object.

Returns:

DateFrame when inplace=False; None when inplace=True

Return type:

arkouda.dataframe.DataFrame or None

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> display(df)

col1

col2

0

1

3

1

2

4

Drop column

>>> df.drop('col1', axis = 1)

col2

0

3

1

4

Drop row

>>> df.drop(0, axis = 0)

col1

col2

0

2

4

drop_duplicates(subset=None, keep='first')[source]

Drops duplcated rows and returns resulting DataFrame.

If a subset of the columns are provided then only one instance of each duplicated row will be returned (keep determines which row).

Parameters:
  • subset (Iterable) – Iterable of column names to use to dedupe.

  • keep ({'first', 'last'}, default='first') – Determines which duplicates (if any) to keep.

Returns:

DataFrame with duplicates removed.

Return type:

arkouda.dataframe.DataFrame

Example

>>> df = ak.DataFrame({'col1': [1, 2, 2, 3], 'col2': [4, 5, 5, 6]})
>>> display(df)

col1

col2

0

1

4

1

2

5

2

2

5

3

3

6

>>> df.drop_duplicates()

col1

col2

0

1

4

1

2

5

2

3

6

dropna(axis: int | str = 0, how: str | None = None, thresh: int | None = None, ignore_index: bool = False) DataFrame[source]

Remove missing values.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default = 0) –

    Determine if rows or columns which contain missing values are removed.

    0, or ‘index’: Drop rows which contain missing values.

    1, or ‘columns’: Drop columns which contain missing value.

    Only a single axis is allowed.

  • how ({'any', 'all'}, default='any') –

    Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.

    ’any’: If any NA values are present, drop that row or column.

    ’all’: If all values are NA, drop that row or column.

  • thresh (int, optional) – Require that many non - NA values.Cannot be combined with how.

  • ignore_index (bool, default False) – If True, the resulting axis will be labeled 0, 1, …, n - 1.

Returns:

DataFrame with NA entries dropped from it.

Return type:

arkouda.dataframe.DataFrame

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import numpy as np
>>> df = ak.DataFrame(
    {
        "A": [True, True, True, True],
        "B": [1, np.nan, 2, np.nan],
        "C": [1, 2, 3, np.nan],
        "D": [False, False, False, False],
        "E": [1, 2, 3, 4],
        "F": ["a", "b", "c", "d"],
        "G": [1, 2, 3, 4],
    }
   )
>>> display(df)

A

B

C

D

E

F

G

0

True

1

1

False

1

a

1

1

True

nan

2

False

2

b

2

2

True

2

3

False

3

c

3

3

True

nan

nan

False

4

d

4

>>> df.dropna()

A

B

C

D

E

F

G

0

True

1

1

False

1

a

1

1

True

2

3

False

3

c

3

>>> df.dropna(axis=1)

A

D

E

F

G

0

True

False

1

a

1

1

True

False

2

b

2

2

True

False

3

c

3

3

True

False

4

d

4

>>> df.dropna(axis=1, thresh=3)

A

C

D

E

F

G

0

True

1

False

1

a

1

1

True

2

False

2

b

2

2

True

3

False

3

c

3

3

True

nan

False

4

d

4

>>> df.dropna(axis=1, how="all")

A

B

C

D

E

F

G

0

True

1

1

False

1

a

1

1

True

nan

2

False

2

b

2

2

True

2

3

False

3

c

3

3

True

nan

nan

False

4

d

4

filter_by_range(keys, low=1, high=None)[source]

Find all rows where the value count of the items in a given set of columns (keys) is within the range [low, high].

To filter by a specific value, set low == high.

Parameters:
  • keys (str or list of str) – The names of the columns to group by.

  • low (int, default=1) – The lowest value count.

  • high (int, default=None) – The highest value count, default to unlimited.

Returns:

An array of boolean values for qualified rows in this DataFrame.

Return type:

arkouda.pdarrayclass.pdarray

Example

>>> df = ak.DataFrame({'col1': [1, 2, 2, 2, 3, 3], 'col2': [4, 5, 6, 7, 8, 9]})
>>> display(df)

col1

col2

0

1

4

1

2

5

2

2

6

3

2

7

4

3

8

5

3

9

>>> df.filter_by_range("col1", low=1, high=2)
array([True False False False True True])
>>> filtered_df = df[df.filter_by_range("col1", low=1, high=2)]
>>> display(filtered_df)

col1

col2

0

1

4

1

3

8

2

3

9

classmethod from_pandas(pd_df)[source]

Copy the data from a pandas DataFrame into a new arkouda.dataframe.DataFrame.

Parameters:

pd_df (pandas.DataFrame) – A pandas DataFrame to convert.

Return type:

arkouda.dataframe.DataFrame

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import pandas as pd
>>> pd_df = pd.DataFrame({"A":[1,2],"B":[3,4]})
>>> type(pd_df)
pandas.core.frame.DataFrame
>>> display(pd_df)

A

B

0

1

3

1

2

4

>>> ak_df = DataFrame.from_pandas(pd_df)
>>> type(ak_df)
arkouda.dataframe.DataFrame
>>> display(ak_df)

A

B

0

1

3

1

2

4

classmethod from_return_msg(rep_msg)[source]

Creates a DataFrame object from an arkouda server response message.

Parameters:

rep_msg (string) – Server response message used to create a DataFrame.

Return type:

arkouda.dataframe.DataFrame

groupby(keys, use_series=True, as_index=True, dropna=True)[source]

Group the dataframe by a column or a list of columns. Alias for GroupBy.

Parameters:
  • keys (str or list of str) – An (ordered) list of column names or a single string to group by.

  • use_series (bool, default=True) – If True, returns an arkouda.dataframe.GroupBy object. Otherwise an arkouda.groupbyclass.GroupBy object.

  • as_index (bool, default=True) – If True, groupby columns will be set as index otherwise, the groupby columns will be treated as DataFrame columns.

  • dropna (bool, default=True) – If True, and the groupby keys contain NaN values, the NaN values together with the corresponding row will be dropped. Otherwise, the rows corresponding to NaN values will be kept.

Returns:

If use_series = True, returns an arkouda.dataframe.GroupBy object. Otherwise returns an arkouda.groupbyclass.GroupBy object.

Return type:

arkouda.dataframe.GroupBy or arkouda.groupbyclass.GroupBy

See also

arkouda.GroupBy

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1.0, 1.0, 2.0, np.nan], 'col2': [4, 5, 6, 7]})
>>> df

col1

col2

0

1

4

1

1

5

2

2

6

3

nan

7

>>> df.GroupBy("col1")
<arkouda.groupbyclass.GroupBy at 0x7f2cf23e10c0>
>>> df.GroupBy("col1").size()
(array([1.00000000000000000 2.00000000000000000]), array([2 1]))
>>> df.GroupBy("col1",use_series=True)
col1
1.0    2
2.0    1
dtype: int64
>>> df.GroupBy("col1",use_series=True, as_index = False).size()

col1

size

0

1

2

1

2

1

head(n=5)[source]

Return the first n rows.

This function returns the first n rows of the the dataframe. It is useful for quickly verifying data, for example, after sorting or appending rows.

Parameters:

n (int, default = 5) – Number of rows to select.

Returns:

The first n rows of the DataFrame.

Return type:

arkouda.dataframe.DataFrame

See also

tail

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': ak.arange(10), 'col2': -1 * ak.arange(10)})
>>> display(df)

col1

col2

0

0

0

1

1

-1

2

2

-2

3

3

-3

4

4

-4

5

5

-5

6

6

-6

7

7

-7

8

8

-8

9

9

-9

>>> df.head()

col1

col2

0

0

0

1

1

-1

2

2

-2

3

3

-3

4

4

-4

>>> df.head(n=2)

col1

col2

0

0

0

1

1

-1

is_registered() bool[source]

Return True if the object is contained in the registry.

Returns:

Indicates if the object is contained in the registry.

Return type:

bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mismatch of registered components.

Notes

Objects registered with the server are immune to deletion until they are unregistered.

Example

>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df.register("my_table_name")
>>> df.attach("my_table_name")
>>> df.is_registered()
True
>>> df.unregister()
>>> df.is_registered()
False
isin(values: arkouda.pdarrayclass.pdarray | Dict | arkouda.series.Series | DataFrame) DataFrame[source]

Determine whether each element in the DataFrame is contained in values.

Parameters:

values (pdarray, dict, Series, or DataFrame) – The values to check for in DataFrame. Series can only have a single index.

Returns:

Arkouda DataFrame of booleans showing whether each element in the DataFrame is contained in values.

Return type:

arkouda.dataframe.DataFrame

See also

ak.Series.isin

Notes

  • Pandas supports values being an iterable type. In arkouda, we replace this with pdarray.

  • Pandas supports ~ operations. Currently, ak.DataFrame does not support this.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col_A': ak.array([7, 3]), 'col_B':ak.array([1, 9])})
>>> display(df)

col_A

col_B

0

7

1

1

3

9

When values is a pdarray, check every value in the DataFrame to determine if it exists in values.

>>> df.isin(ak.array([0, 1]))

col_A

col_B

0

0

1

1

0

0

When values is a dict, the values in the dict are passed to check the column indicated by the key.

>>> df.isin({'col_A': ak.array([0, 3])})

col_A

col_B

0

0

0

1

1

0

When values is a Series, each column is checked if values is present positionally. This means that for True to be returned, the indexes must be the same.

>>> i = ak.Index(ak.arange(2))
>>> s = ak.Series(data=[3, 9], index=i)
>>> df.isin(s)

col_A

col_B

0

0

0

1

0

1

When values is a DataFrame, the index and column must match. Note that 9 is not found because the column name does not match.

>>> other_df = ak.DataFrame({'col_A':ak.array([7, 3]), 'col_C':ak.array([0, 9])})
>>> df.isin(other_df)

col_A

col_B

0

1

0

1

1

0

isna() DataFrame[source]

Detect missing values.

Return a boolean same-sized object indicating if the values are NA. numpy.NaN values get mapped to True values. Everything else gets mapped to False values.

Returns:

Mask of bool values for each element in DataFrame that indicates whether an element is an NA value.

Return type:

arkouda.dataframe.DataFrame

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import numpy as np
>>> df = ak.DataFrame({"A": [np.nan, 2, 2, 3], "B": [3, np.nan, 5, 6],
...          "C": [1, np.nan, 2, np.nan], "D":["a","b","c","d"]})
>>> display(df)

A

B

C

D

0

nan

3

1

a

1

2

nan

nan

b

2

2

5

2

c

3

3

6

nan

d

>>> df.isna()
       A      B      C      D
0   True  False  False  False
1  False   True   True  False
2  False  False  False  False
3  False  False   True  False (4 rows x 4 columns)
classmethod load(prefix_path, file_format='INFER')[source]

Load dataframe from file. file_format needed for consistency with other load functions.

Parameters:
  • prefix_path (str) – The prefix path for the data.

  • file_format (string, default = "INFER")

Returns:

A dataframe loaded from the prefix_path.

Return type:

arkouda.dataframe.DataFrame

Examples

To store data in <my_dir>/my_data_LOCALE0000, use “<my_dir>/my_data” as the prefix.

>>> import arkouda as ak
>>> ak.connect()
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), 'hdf5_output','my_data')
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A": ak.arange(5), "B": -1 * ak.arange(5)})
>>> df.save(my_path, file_type="distribute")
>>> df.load(my_path)

A

B

0

0

0

1

1

-1

2

2

-2

3

3

-3

4

4

-4

memory_usage(index=True, unit='B') arkouda.series.Series[source]

Return the memory usage of each column in bytes.

The memory usage can optionally include the contribution of the index.

Parameters:
  • index (bool, default True) – Specifies whether to include the memory usage of the DataFrame’s index in returned Series. If index=True, the memory usage of the index is the first item in the output.

  • unit (str, default = "B") – Unit to return. One of {‘B’, ‘KB’, ‘MB’, ‘GB’}.

Returns:

A Series whose index is the original column names and whose values is the memory usage of each column in bytes.

Return type:

Series

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> dtypes = [ak.int64, ak.float64,  ak.bool]
>>> data = dict([(str(t), ak.ones(5000, dtype=ak.int64).astype(t)) for t in dtypes])
>>> df = ak.DataFrame(data)
>>> display(df.head())

int64

float64

bool

0

1

1

True

1

1

1

True

2

1

1

True

3

1

1

True

4

1

1

True

>>> df.memory_usage()

0

Index

40000

int64

40000

float64

40000

bool

5000

>>> df.memory_usage(index=False)

0

int64

40000

float64

40000

bool

5000

>>> df.memory_usage(unit="KB")

0

Index

39.0625

int64

39.0625

float64

39.0625

bool

4.88281

To get the approximate total memory usage:

>>>  df.memory_usage(index=True).sum()
memory_usage_info(unit='GB')[source]

A formatted string representation of the size of this DataFrame.

Parameters:

unit (str, default = "GB") – Unit to return. One of {‘KB’, ‘MB’, ‘GB’}.

Returns:

A string representation of the number of bytes used by this DataFrame in [unit]s.

Return type:

str

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': ak.arange(1000), 'col2': ak.arange(1000)})
>>> df.memory_usage_info()
'0.00 GB'
>>> df.memory_usage_info(unit="KB")
'15 KB'
merge(right: DataFrame, on: str | List[str] | None = None, how: str = 'inner', left_suffix: str = '_x', right_suffix: str = '_y', convert_ints: bool = True, sort: bool = True) DataFrame[source]

Merge Arkouda DataFrames with a database-style join. The resulting dataframe contains rows from both DataFrames as specified by the merge condition (based on the “how” and “on” parameters).

Based on pandas merge functionality. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

Parameters:
  • right (DataFrame) – The Right DataFrame to be joined.

  • on (Optional[Union[str, List[str]]] = None) – The name or list of names of the DataFrame column(s) to join on. If on is None, this defaults to the intersection of the columns in both DataFrames.

  • how ({"inner", "left", "right}, default = "inner") – The merge condition. Must be “inner”, “left”, or “right”.

  • left_suffix (str, default = "_x") – A string indicating the suffix to add to columns from the left dataframe for overlapping column names in both left and right. Defaults to “_x”. Only used when how is “inner”.

  • right_suffix (str, default = "_y") – A string indicating the suffix to add to columns from the right dataframe for overlapping column names in both left and right. Defaults to “_y”. Only used when how is “inner”.

  • convert_ints (bool = True) – If True, convert columns with missing int values (due to the join) to float64. This is to match pandas. If False, do not convert the column dtypes. This has no effect when how = “inner”.

  • sort (bool = True) – If True, DataFrame is returned sorted by “on”. Otherwise, the DataFrame is not sorted.

Returns:

Joined Arkouda DataFrame.

Return type:

arkouda.dataframe.DataFrame

Note

Multiple column joins are only supported for integer columns.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> left_df = ak.DataFrame({'col1': ak.arange(5), 'col2': -1 * ak.arange(5)})
>>> display(left_df)

col1

col2

0

0

0

1

1

-1

2

2

-2

3

3

-3

4

4

-4

>>> right_df = ak.DataFrame({'col1': 2 * ak.arange(5), 'col2': 2 * ak.arange(5)})
>>> display(right_df)

col1

col2

0

0

0

1

2

2

2

4

4

3

6

6

4

8

8

>>> left_df.merge(right_df, on = "col1")

col1

col2_x

col2_y

0

0

0

0

1

2

-2

2

2

4

-4

4

>>> left_df.merge(right_df, on = "col1", how = "left")

col1

col2_y

col2_x

0

0

0

0

1

1

nan

-1

2

2

2

-2

3

3

nan

-3

4

4

4

-4

>>> left_df.merge(right_df, on = "col1", how = "right")

col1

col2_x

col2_y

0

0

0

0

1

2

-2

2

2

4

-4

4

3

6

nan

6

4

8

nan

8

>>> left_df.merge(right_df, on = "col1", how = "outer")

col1

col2_y

col2_x

0

0

0

0

1

1

nan

-1

2

2

2

-2

3

3

nan

-3

4

4

4

-4

5

6

6

nan

6

8

8

nan

notna() DataFrame[source]

Detect existing (non-missing) values.

Return a boolean same-sized object indicating if the values are not NA. numpy.NaN values get mapped to False values.

Returns:

Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.

Return type:

arkouda.dataframe.DataFrame

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import numpy as np
>>> df = ak.DataFrame({"A": [np.nan, 2, 2, 3], "B": [3, np.nan, 5, 6],
...          "C": [1, np.nan, 2, np.nan], "D":["a","b","c","d"]})
>>> display(df)

A

B

C

D

0

nan

3

1

a

1

2

nan

nan

b

2

2

5

2

c

3

3

6

nan

d

>>> df.notna()
       A      B      C     D
0  False   True   True  True
1   True  False  False  True
2   True   True   True  True
3   True   True  False  True (4 rows x 4 columns)
classmethod read_csv(filename: str, col_delim: str = ',')[source]

Read the columns of a CSV file into an Arkouda DataFrame. If the file contains the appropriately formatted header, typed data will be returned. Otherwise, all data will be returned as a Strings objects.

Parameters:
  • filename (str) – Filename to read data from.

  • col_delim (str, default=",") – The delimiter for columns within the data.

Returns:

Arkouda DataFrame containing the columns from the CSV file.

Return type:

arkouda.dataframe.DataFrame

Raises:
  • ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist.

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

  • TypeError – Raised if we receive an unknown arkouda_type returned from the server.

See also

to_csv

Notes

  • CSV format is not currently supported by load/load_all operations.

  • The column delimiter is expected to be the same for column names and data.

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (”\n”) at this time.

  • Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing

bytes as uint(8).

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), 'csv_output','my_data')
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]})
>>> df.to_csv(my_path)
>>> df2 = DataFrame.read_csv(my_path + "_LOCALE0000")
>>> display(df2)

A

B

0

1

3

1

2

4

register(user_defined_name: str) DataFrame[source]

Register this DataFrame object and underlying components with the Arkouda server.

Parameters:

user_defined_name (str) – User defined name the DataFrame is to be registered under. This will be the root name for underlying components.

Returns:

The same DataFrame which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different DataFrames with the same name.

Return type:

arkouda.dataframe.DataFrame

Raises:
  • TypeError – Raised if user_defined_name is not a str.

  • RegistrationError – If the server was unable to register the DataFrame with the user_defined_name.

Notes

Objects registered with the server are immune to deletion until they are unregistered.

Any changes made to a DataFrame object after registering with the server may not be reflected in attached copies.

Example

>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df.register("my_table_name")
>>> df.attach("my_table_name")
>>> df.is_registered()
True
>>> df.unregister()
>>> df.is_registered()
False
rename(mapper: Callable | Dict | None = None, index: Callable | Dict | None = None, column: Callable | Dict | None = None, axis: str | int = 0, inplace: bool = False) DataFrame | None[source]

Rename indexes or columns according to a mapping.

Parameters:
  • mapper (callable or dict-like, Optional) – Function or dictionary mapping existing values to new values. Nonexistent names will not raise an error. Uses the value of axis to determine if renaming column or index

  • column (callable or dict-like, Optional) – Function or dictionary mapping existing column names to new column names. Nonexistent names will not raise an error. When this is set, axis is ignored.

  • index (callable or dict-like, Optional) – Function or dictionary mapping existing index names to new index names. Nonexistent names will not raise an error. When this is set, axis is ignored.

  • axis (int or str, default=0) – Indicates which axis to perform the rename. 0/”index” - Indexes 1/”column” - Columns

  • inplace (bool, default=False) – When True, perform the operation on the calling object. When False, return a new object.

Returns:

DateFrame when inplace=False; None when inplace=True.

Return type:

arkouda.dataframe.DataFrame or None

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({"A": ak.array([1, 2, 3]), "B": ak.array([4, 5, 6])})
>>> display(df)

A

B

0

1

4

1

2

5

2

3

6

Rename columns using a mapping:

>>> df.rename(column={'A':'a', 'B':'c'})

a

c

0

1

4

1

2

5

2

3

6

Rename indexes using a mapping:

>>> df.rename(index={0:99, 2:11})

A

B

0

1

4

1

2

5

2

3

6

Rename using an axis style parameter:

>>> df.rename(str.lower, axis='column')

a

b

0

1

4

1

2

5

2

3

6

reset_index(size: int | None = None, inplace: bool = False) None | DataFrame[source]

Set the index to an integer range.

Useful if this dataframe is the result of a slice operation from another dataframe, or if you have permuted the rows and no longer need to keep that ordering on the rows.

Parameters:
  • size (int, optional) – If size is passed, do not attempt to determine size based on existing column sizes. Assume caller handles consistency correctly.

  • inplace (bool, default=False) – When True, perform the operation on the calling object. When False, return a new object.

Returns:

DateFrame when inplace=False; None when inplace=True.

Return type:

arkouda.dataframe.DataFrame or None

Note

Pandas adds a column ‘index’ to indicate the original index. Arkouda does not currently support this behavior.

Example

>>> df = ak.DataFrame({"A": ak.array([1, 2, 3]), "B": ak.array([4, 5, 6])})
>>> display(df)

A

B

0

1

4

1

2

5

2

3

6

>>> perm_df = df[ak.array([0,2,1])]
>>> display(perm_df)

A

B

0

1

4

1

3

6

2

2

5

>>> perm_df.reset_index()

A

B

0

1

4

1

3

6

2

2

5

sample(n=5)[source]

Return a random sample of n rows.

Parameters:

n (int, default=5) – Number of rows to return.

Returns:

The sampled n rows of the DataFrame.

Return type:

arkouda.dataframe.DataFrame

Example

>>> df = ak.DataFrame({"A": ak.arange(5), "B": -1 * ak.arange(5)})
>>> display(df)

A

B

0

0

0

1

1

-1

2

2

-2

3

3

-3

4

4

-4

Random output of size 3:

>>> df.sample(n=3)

A

B

0

0

0

1

1

-1

2

4

-4

save(path, index=False, columns=None, file_format='HDF5', file_type='distribute', compression: str | None = None)[source]

DEPRECATED Save DataFrame to disk, preserving column names.

Parameters:
  • path (str) – File path to save data.

  • index (bool, default=False) – If True, save the index column. By default, do not save the index.

  • columns (list, default=None) – List of columns to include in the file. If None, writes out all columns.

  • file_format (str, default='HDF5') – ‘HDF5’ or ‘Parquet’. Defaults to ‘HDF5’

  • file_type (str, default=distribute) – “single” or “distribute” If single, will right a single file to locale 0.

  • compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Compression type. Only used for Parquet

Notes

This method saves one file per locale of the arkouda server. All files are prefixed by the path argument and suffixed by their locale number.

See also

to_parquet, to_hdf

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), 'hdf5_output')
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A": ak.arange(5), "B": -1 * ak.arange(5)})
>>> df.save(my_path + '/my_data', file_type="single")
>>> df.load(my_path + '/my_data')

A

B

0

0

0

1

1

-1

2

2

-2

3

3

-3

4

4

-4

sort_index(ascending=True)[source]

Sort the DataFrame by indexed columns.

Note: Fails on sort order of arkouda.strings.Strings columns when multiple columns being sorted.

Parameters:

ascending (bool, default = True) – Sort values in ascending (default) or descending order.

Example

>>> df = ak.DataFrame({'col1': [1.1, 3.1, 2.1], 'col2': [6, 5, 4]},
...          index = Index(ak.array([2,0,1]), name="idx"))
>>> display(df)

idx

col1

col2

0

1.1

6

1

3.1

5

2

2.1

4

>>> df.sort_index()

idx

col1

col2

0

3.1

5

1

2.1

4

2

1.1

6

sort_values(by=None, ascending=True)[source]

Sort the DataFrame by one or more columns.

If no column is specified, all columns are used.

Note: Fails on order of arkouda.strings.Strings columns when multiple columns being sorted.

Parameters:
  • by (str or list/tuple of str, default = None) – The name(s) of the column(s) to sort by.

  • ascending (bool, default = True) – Sort values in ascending (default) or descending order.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [2, 2, 1], 'col2': [3, 4, 3], 'col3':[5, 6, 7]})
>>> display(df)

col1

col2

col3

0

2

3

5

1

2

4

6

2

1

3

7

>>> df.sort_values()

col1

col2

col3

0

1

3

7

1

2

3

5

2

2

4

6

>>> df.sort_values("col3")

col1

col2

col3

0

1

3

7

1

2

3

5

2

2

4

6

tail(n=5)[source]

Return the last n rows.

This function returns the last n rows for the dataframe. It is useful for quickly testing if your object has the right type of data in it.

Parameters:

n (int, default=5) – Number of rows to select.

Returns:

The last n rows of the DataFrame.

Return type:

arkouda.dataframe.DataFrame

See also

arkouda.dataframe.head

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': ak.arange(10), 'col2': -1 * ak.arange(10)})
>>> display(df)

col1

col2

0

0

0

1

1

-1

2

2

-2

3

3

-3

4

4

-4

5

5

-5

6

6

-6

7

7

-7

8

8

-8

9

9

-9

>>> df.tail()

col1

col2

0

5

-5

1

6

-6

2

7

-7

3

8

-8

4

9

-9

>>> df.tail(n=2)

col1

col2

0

8

-8

1

9

-9

to_csv(path: str, index: bool = False, columns: List[str] | None = None, col_delim: str = ',', overwrite: bool = False)[source]

Writes DataFrame to CSV file(s). File will contain a column for each column in the DataFrame. All CSV Files written by Arkouda include a header denoting data types of the columns. Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).

Parameters:
  • path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.

  • index (bool, default=False) – If True, the index of the DataFrame will be written to the file as a column.

  • columns (list of str (Optional)) – Column names to assign when writing data.

  • col_delim (str, default=",") – Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.

  • overwrite (bool, default=False) – If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.

Return type:

None

Raises:
  • ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist.

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

  • TypeError – Raised if we receive an unknown arkouda_type returned from the server.

Notes

  • CSV format is not currently supported by load/load_all operations.

  • The column delimiter is expected to be the same for column names and data.

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (”\n”) at this time.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), 'csv_output')
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]})
>>> df.to_csv(my_path + "/my_data")
>>> df2 = DataFrame.read_csv(my_path + "/my_data" + "_LOCALE0000")
>>> display(df2)

A

B

0

1

3

1

2

4

to_hdf(path, index=False, columns=None, file_type='distribute')[source]

Save DataFrame to disk as hdf5, preserving column names.

Parameters:
  • path (str) – File path to save data.

  • index (bool, default=False) – If True, save the index column. By default, do not save the index.

  • columns (List, default = None) – List of columns to include in the file. If None, writes out all columns.

  • file_type (str (single | distribute), default=distribute) – Whether to save to a single file or distribute across Locales.

Return type:

None

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray.

Notes

This method saves one file per locale of the arkouda server. All files are prefixed by the path argument and suffixed by their locale number.

See also

to_parquet, load

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), 'hdf_output')
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]})
>>> df.to_hdf(my_path + "/my_data")
>>> df.load(my_path + "/my_data")

A

B

0

1

3

1

2

4

to_markdown(mode='wt', index=True, tablefmt='grid', storage_options=None, **kwargs)[source]

Print DataFrame in Markdown-friendly format.

Parameters:
  • mode (str, optional) – Mode in which file is opened, “wt” by default.

  • index (bool, optional, default True) – Add index (row) labels.

  • tablefmt (str = "grid") – Table format to call from tablulate: https://pypi.org/project/tabulate/

  • storage_options (dict, optional) – Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc., if using a URL that will be parsed by fsspec, e.g., starting “s3://”, “gcs://”. An error will be raised if providing this argument with a non-fsspec URL. See the fsspec and backend storage implementation docs for the set of allowed keys and values.

  • **kwargs – These parameters will be passed to tabulate.

Note

This function should only be called on small DataFrames as it calls pandas.DataFrame.to_markdown: https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.DataFrame.to_markdown.html

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({"animal_1": ["elk", "pig"], "animal_2": ["dog", "quetzal"]})
>>> print(df.to_markdown())
+----+------------+------------+
|    | animal_1   | animal_2   |
+====+============+============+
|  0 | elk        | dog        |
+----+------------+------------+
|  1 | pig        | quetzal    |
+----+------------+------------+

Suppress the index:

>>> print(df.to_markdown(index = False))
+------------+------------+
| animal_1   | animal_2   |
+============+============+
| elk        | dog        |
+------------+------------+
| pig        | quetzal    |
+------------+------------+
to_pandas(datalimit=maxTransferBytes, retain_index=False)[source]

Send this DataFrame to a pandas DataFrame.

Parameters:
  • datalimit (int, default=arkouda.client.maxTransferBytes) – The maximum number size, in megabytes to transfer. The requested DataFrame will be converted to a pandas DataFrame only if the estimated size of the DataFrame does not exceed this value.

  • retain_index (bool, default=False) – Normally, to_pandas() creates a new range index object. If you want to keep the index column, set this to True.

Returns:

The result of converting this DataFrame to a pandas DataFrame.

Return type:

pandas.DataFrame

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> ak_df = ak.DataFrame({"A": ak.arange(2), "B": -1 * ak.arange(2)})
>>> type(ak_df)
arkouda.dataframe.DataFrame
>>> display(ak_df)

A

B

0

0

0

1

1

-1

>>> import pandas as pd
>>> pd_df = ak_df.to_pandas()
>>> type(pd_df)
pandas.core.frame.DataFrame
>>> display(pd_df)

A

B

0

0

0

1

1

-1

to_parquet(path, index=False, columns=None, compression: str | None = None, convert_categoricals: bool = False)[source]

Save DataFrame to disk as parquet, preserving column names.

Parameters:
  • path (str) – File path to save data.

  • index (bool, default=False) – If True, save the index column. By default, do not save the index.

  • columns (list) – List of columns to include in the file. If None, writes out all columns.

  • compression (str (Optional), default=None) – Provide the compression type to use when writing the file. Supported values: snappy, gzip, brotli, zstd, lz4

  • convert_categoricals (bool, default=False) – Parquet requires all columns to be the same size and Categoricals don’t satisfy that requirement. If set, write the equivalent Strings in place of any Categorical columns.

Return type:

None

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

This method saves one file per locale of the arkouda server. All files are prefixed by the path argument and suffixed by their locale number.

See also

to_hdf, load

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), 'parquet_output')
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]})
>>> df.to_parquet(my_path + "/my_data")
>>> df.load(my_path + "/my_data")

B

A

0

3

1

1

4

2

transfer(hostname, port)[source]

Sends a DataFrame to a different Arkouda server.

Parameters:
  • hostname (str) – The hostname where the Arkouda server intended to receive the DataFrame is running.

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().

Returns:

A message indicating a complete transfer.

Return type:

str

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

unregister()[source]

Unregister this DataFrame object in the arkouda server which was previously registered using register() and/or attached to using attach().

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister.

Notes

Objects registered with the server are immune to deletion until they are unregistered.

Example

>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df.register("my_table_name")
>>> df.attach("my_table_name")
>>> df.is_registered()
True
>>> df.unregister()
>>> df.is_registered()
False
static unregister_dataframe_by_name(user_defined_name: str) str[source]

Function to unregister DataFrame object by name which was registered with the arkouda server via register().

Parameters:

user_defined_name (str) – Name under which the DataFrame object was registered.

Raises:
  • TypeError – If user_defined_name is not a string.

  • RegistrationError – If there is an issue attempting to unregister any underlying components.

Example

>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df.register("my_table_name")
>>> df.attach("my_table_name")
>>> df.is_registered()
True
>>> df.unregister_dataframe_by_name("my_table_name")
>>> df.is_registered()
False
update_hdf(prefix_path: str, index=False, columns=None, repack: bool = True)[source]

Overwrite the dataset with the name provided with this dataframe. If the dataset does not exist it is added.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share.

  • index (bool, default=False) – If True, save the index column. By default, do not save the index.

  • columns (List, default=None) – List of columns to include in the file. If None, writes out all columns.

  • repack (bool, default=True) – HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Returns:

Success message if successful.

Return type:

str

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray.

Notes

If file does not contain File_Format attribute to indicate how it was saved,

the file name is checked for _LOCALE#### to determine if it is distributed.

If the dataset provided does not exist, it will be added.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), 'hdf_output')
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]})
>>> df.to_hdf(my_path + "/my_data")
>>> df.load(my_path + "/my_data")

A

B

0

1

3

1

2

4

>>> df2 = ak.DataFrame({"A":[5,6],"B":[7,8]})
>>> df2.update_hdf(my_path + "/my_data")
>>> df.load(my_path + "/my_data")

A

B

0

5

7

1

6

8

update_nrows()[source]

Computes the number of rows on the arkouda server and updates the size parameter.

class arkouda.dataframe.DiffAggregate(gb, series)[source]

A column in a GroupBy that has been differenced. Aggregation operations can be done on the result.

gb

GroupBy object, where the aggregation keys are values of column(s) of a dataframe.

Type:

arkouda.groupbyclass.GroupBy

values

A column to compute the difference on.

Type:

arkouda.series.Series.

arkouda.dataframe.intersect(a, b, positions=True, unique=False)[source]

Find the intersection of two arkouda arrays.

This function can be especially useful when positions=True so that the caller gets the indices of values present in both arrays.

Parameters:
  • a (Strings or pdarray) – An array of strings.

  • b (Strings or pdarray) – An array of strings.

  • positions (bool, default=True) – Return tuple of boolean pdarrays that indicate positions in a and b of the intersection values.

  • unique (bool, default=False) – If the number of distinct values in a (and b) is equal to the size of a (and b), there is a more efficient method to compute the intersection.

Returns:

The indices of a and b where any element occurs at least once in both arrays.

Return type:

(arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray) or arkouda.pdarrayclass.pdarray

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> a = ak.arange(10)
>>> print(a)
[0 1 2 3 4 5 6 7 8 9]
>>> b = 2 * ak.arange(10)
>>> print(b)
[0 2 4 6 8 10 12 14 16 18]
>>> intersect(a,b, positions=True)
(array([True False True False True False True False True False]),
array([True True True True True False False False False False]))
>>> intersect(a,b, positions=False)
array([0 2 4 6 8])
arkouda.dataframe.intx(a, b)[source]

Find all the rows that are in both dataframes. Columns should be in identical order.

Note: does not work for columns of floating point values, but does work for Strings, pdarrays of int64 type, and Categorical should work.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> a = ak.DataFrame({'a':ak.arange(5),'b': 2* ak.arange(5)})
>>> display(a)

a

b

0

0

0

1

1

2

2

2

4

3

3

6

4

4

8

>>> b = ak.DataFrame({'a':ak.arange(5),'b':ak.array([0,3,4,7,8])})
>>> display(b)

a

b

0

0

0

1

1

3

2

2

4

3

3

7

4

4

8

>>> intx(a,b)
>>> intersect_df = a[intx(a,b)]
>>> display(intersect_df)

a

b

0

0

0

1

2

4

2

4

8

arkouda.dataframe.invert_permutation(perm)[source]

Find the inverse of a permutation array.

Parameters:

perm (pdarray) – The permutation array.

Returns:

The inverse of the permutation array.

Return type:

arkouda.pdarrayclass.pdarray

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> from arkouda.index import Index
>>> i = Index(ak.array([1,2,0,5,4]))
>>> perm = i.argsort()
>>> print(perm)
[2 0 1 4 3]
>>> invert_permutation(perm)
array([1 2 0 4 3])
arkouda.dataframe.merge(left: DataFrame, right: DataFrame, on: str | List[str] | None = None, how: str = 'inner', left_suffix: str = '_x', right_suffix: str = '_y', convert_ints: bool = True, sort: bool = True) DataFrame[source]

Merge Arkouda DataFrames with a database-style join. The resulting dataframe contains rows from both DataFrames as specified by the merge condition (based on the “how” and “on” parameters).

Based on pandas merge functionality. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

Parameters:
  • left (DataFrame) – The Left DataFrame to be joined.

  • right (DataFrame) – The Right DataFrame to be joined.

  • on (Optional[Union[str, List[str]]] = None) – The name or list of names of the DataFrame column(s) to join on. If on is None, this defaults to the intersection of the columns in both DataFrames.

  • how (str, default = "inner") – The merge condition. Must be one of “inner”, “left”, “right”, or “outer”.

  • left_suffix (str, default = "_x") – A string indicating the suffix to add to columns from the left dataframe for overlapping column names in both left and right. Defaults to “_x”. Only used when how is “inner”.

  • right_suffix (str, default = "_y") – A string indicating the suffix to add to columns from the right dataframe for overlapping column names in both left and right. Defaults to “_y”. Only used when how is “inner”.

  • convert_ints (bool = True) – If True, convert columns with missing int values (due to the join) to float64. This is to match pandas. If False, do not convert the column dtypes. This has no effect when how = “inner”.

  • sort (bool = True) – If True, DataFrame is returned sorted by “on”. Otherwise, the DataFrame is not sorted.

Returns:

Joined Arkouda DataFrame.

Return type:

arkouda.dataframe.DataFrame

Note

Multiple column joins are only supported for integer columns.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> from arkouda import merge
>>> left_df = ak.DataFrame({'col1': ak.arange(5), 'col2': -1 * ak.arange(5)})
>>> display(left_df)

col1

col2

0

0

0

1

1

-1

2

2

-2

3

3

-3

4

4

-4

>>> right_df = ak.DataFrame({'col1': 2 * ak.arange(5), 'col2': 2 * ak.arange(5)})
>>> display(right_df)

col1

col2

0

0

0

1

2

2

2

4

4

3

6

6

4

8

8

>>> merge(left_df, right_df, on = "col1")

col1

col2_x

col2_y

0

0

0

0

1

2

-2

2

2

4

-4

4

>>> merge(left_df, right_df, on = "col1", how = "left")

col1

col2_y

col2_x

0

0

0

0

1

1

nan

-1

2

2

2

-2

3

3

nan

-3

4

4

4

-4

>>> merge(left_df, right_df, on = "col1", how = "right")

col1

col2_x

col2_y

0

0

0

0

1

2

-2

2

2

4

-4

4

3

6

nan

6

4

8

nan

8

>>> merge(left_df, right_df, on = "col1", how = "outer")

col1

col2_y

col2_x

0

0

0

0

1

1

nan

-1

2

2

2

-2

3

3

nan

-3

4

4

4

-4

5

6

6

nan

6

8

8

nan