Examples

Arkouda Arrays

Arkouda arrays function similarly to arrays in NumPy, but allow for a much larger scale. In Arkouda, arrays are referred to as pdarray objects. It is possible to generate a pdarray from a Python list, NumPy ndarray, or using a generator method similar to those found in NumPy. This document aims to provide an introduction into some of the most commonly used elements of Arkouda and is not an exhaustive description of functionality.

pdarray Creation

Like ndarray objects in NumPy, Arkouda pdarray objects can be generated from a Python list.

# create the Python List
>> l = [0, 1, 2, 3, 4]

# generate a pdarray
>> ak_arr = ak.array(l)
>> ak_arr
array([0 1 2 3 4])

pdarray objects can be generated directly from an ndarray. This allows you to easily move objects into Arkouda from NumPy.

# create an ndarray
>> np_arr = np.array([0, 1, 2, 3, 4])

# generate a pdarray
>> ak_arr = ak.array(np_arr)
>> ak_arr
array([0 1 2 3 4])

pdarray objects can be generated using generator calls such as arange and randint.

# arange
>> ak_arr = ak.arange(10)
>> ak_arr
array([0 1 2 3 4 5 6 7 8 9])

# randint(low, high, size)
>> r = ak.randint(0, 100, 10)
>> r # output will vary
array([52 84 1 52 80 71 27 20 7 7])

Exporting pdarray Objects

Arkouda allows users to export pdarray objects to other formats to aide in transitioning between toolsets. A pdarray can be exported to a NumPy ndarray or a Python list.

# create pdarray
>> ak_arr = ak.array([0, 1, 2, 3, 4])

# export to ndarray
>> np_arr = ak_arr.to_ndarray()
>> np_arr
array([0, 1, 2, 3, 4])

# export to a Python List
>> l = ak_arr.to_list()
>> l
[0, 1, 2, 3, 4]

pdarray Set operations

Like NumPy, Arkouda supports set operations on pdarray objects. The supported set operations are

  • IN (in1d) : Test whether each element of a 1-D array is also present in a second array.

  • UNION (union1d) : Compute the unique union of the arrays

  • INTERSECT (intersect1d) : Compute the unique intersection of the arrays.

  • SET DIFFERENCE (setdiff1d) : Compute the difference between the two arrays.

  • SYMMETRIC DIFFERENCE (setxor1d) : Compute the exclusive-or of the two arrays.

One important note is that Arkouda takes this functionality beyond a single dimension. These operations can be performed on lists of pdarrays as well. We will look at in1d and intersect1d in both 1 dimension and multiple in the code block below.

# configure 2 pdarrays to run against
>> a = ak.array([4, 2, 5, 6, 4, 7, 2])
>> b = ak.array([1, 5, 4, 11, 9, 6])

# compute boolean array indicating the values from a found in b.
>> ak_in1d = ak.in1d(a, b)
>> ak_in1d
array([True False True True True False False])

# compute array of unique values found in a and b
>> ak_int = ak.intersect1d(a, b)
>> ak_int
array([4 5 6])

# Arkouda can perform this operation on multiple arrays at once
>> m1 =[
    ak.array([0, 1, 3, 4, 8, 5, 0]),
    ak.array([0, 9, 5, 1, 8, 5, 0])
]
>> m2 =[
    ak.array([0, 1, 3, 4, 8, 7]),
    ak.array([0, 2, 5, 9, 8, 5])
]


>> ak_in1dmult = ak.in1d(m1, m2)
>> ak_in1dmulti
array([True False True False True False True])

>> ak_intmult = ak.intersect1d(m1, m2)
>> ak_intmult
[array([0 3 8]), array([0 5 8])]

There are a few things to keep in mind when working in the multi-dimension case. First, m1 and m2 must be Python lists containing the same number of pdarray elements. Second, the values are treated as a tuple. Using our example above, the first value of m1 is viewed as (0, 0) during computation.

Arkouda DataFrames

Like in Pandas, Arkouda supports the construct of a DataFrame. The structure of these objects is very similar, though some functionality may vary. DataFrames are extremely useful when working with multiple pdarray objects that are related. In Arkouda, DataFrames consist of an Index (which uses are Arkouda.Index), Column Names and Column Data.

Creating & Using a DataFrame

Let’s take a look at creating a DataFrame in Arkouda. Once again, we have several methods to create a DataFrame in Arkouda:

  • Importing a Pandas DataFrame

  • Python Mapping {column_name: column_data}. column_data must be pdarray. column_name will be used by the constructor to set the column names for the DataFrame

The most important thing to remember is that each column of an Arkouda DataFrame is a pdarray and must be provided as such. The only exception is when a Pandas DataFrame is being imported because the constructor will generate the pdarray objects for you from the columns of the Pandas DataFrame.

Importing Pandas DataFrame

# construct the Pandas DataFrame
>> fname = ['John', 'Jane', 'John', 'Jake']
>> lname = ['Doe', 'Doe', 'Smith', 'Brown']
>> age = [37, 35, 50, 32]
>> salary = [75000, 77000, 100000, 35000]
>> pd_df = pd.DataFrame({
    'F_Name': fname,
    'L_Name': lname,
    'Age': age,
    'Salary': salary
})
>> pd_df
    F_Name L_Name  Age  Salary
0   John    Doe   37   75000
1   Jane    Doe   35   77000
2   John  Smith   50  100000
3   Jake  Brown   32   35000

# call the Arkouda DataFrame constructor
>> df = ak.DataFrame(pd_df)
>> df
    F_Name L_Name  Age  Salary
0   John    Doe   37   75000
1   Jane    Doe   35   77000
2   John  Smith   50  100000
3   Jake  Brown   32   35000 (4 rows x 4 columns)

Python Mapping

>> fname = ak.array(['John', 'Jane', 'John', 'Jake'])
>> lname = ak.array(['Doe', 'Doe', 'Smith', 'Brown'])
>> age = ak.array([37, 35, 50, 32])
>> salary = ak.array([75000, 77000, 100000, 35000])
>> df = ak.DataFrame({
    'F_Name': fname,
    'L_Name': lname,
    'Age': age,
    'Salary': salary
})

>> df
    F_Name L_Name  Age  Salary
0   John    Doe   37   75000
1   Jane    Doe   35   77000
2   John  Smith   50  100000
3   Jake  Brown   32   35000 (4 rows x 4 columns)

NOTICE: Here the call to the Arkouda DataFrame constructor takes in very close to the same information as the Pandas constructor, but with one key difference. Each of the columns is an Arkouda pdarray.

Basic Interaction

Please Note: For this section we will be using the same DataFrame generated in the creation demos.

In this section, we will highlight some of the basics of DataFrame interaction in Arkouda. You should notice that it is very similar to interacting with a Pandas DataFrame.

# adding reference to dataframe created earlier for easy reference
>> df
    F_Name L_Name  Age  Salary
0   John    Doe   37   75000
1   Jane    Doe   35   77000
2   John  Smith   50  100000
3   Jake  Brown   32   35000 (4 rows x 4 columns)

# accessing a column
>> df['Age']
array([37 35 50 32])

# accessing multiple columns at once
>> df['L_Name', 'Age'] # equivalent to df[['L_Name', 'Age']]
    L_Name  Age
0    Doe   37
1    Doe   35
2  Smith   50
3  Brown   32 (4 rows x 2 columns)

# accessing row
>> df[0]
{'F_Name': 'John', 'L_Name': 'Doe', 'Age': 37, 'Salary': 75000}

# accessing row slice
>> df[0:2]
    F_Name L_Name  Age  Salary
0   John    Doe   37   75000
1   Jane    Doe   35   77000 (2 rows x 4 columns)

# accessing multiple indexes
>> idx = ak.array([0, 2, 3])
>> df[idx]
    F_Name L_Name  Age  Salary
0   John    Doe   37   75000
2   John  Smith   50  100000
3   Jake  Brown   32   35000 (3 rows x 4 columns)

Exporting to Pandas

Exporting an Arkouda DataFrame to Pandas is extremely simple using the to_pandas function.

# adding reference to dataframe created earlier for easy reference
>> df
    F_Name L_Name  Age  Salary
0   John    Doe   37   75000
1   Jane    Doe   35   77000
2   John  Smith   50  100000
3   Jake  Brown   32   35000 (4 rows x 4 columns)

>> pd_df = df.to_pandas()
>> pd_df
    F_Name L_Name  Age  Salary
0   John    Doe   37   75000
1   Jane    Doe   35   77000
2   John  Smith   50  100000
3   Jake  Brown   32   35000

GroupBy

In Pandas, groupby-aggregate is a very useful pattern that can be computationally intensive. Arkouda supports grouping by key and most aggregations in Pandas. GroupBy functionality in Arkouda is supported on pdarray and DataFrame objects.

pdarrays

# using randint for more interesting results. Note values will vary
>> x = ak.randint(0, 10, 100)
>> g = ak.GroupBy(x)
>> g.count()
(array([0 1 2 3 4 5 6 7 8 9]), array([14 5 8 17 14 8 5 9 11 9]))

DataFrames

# adding reference to dataframe created earlier for easy reference
>> df
    F_Name L_Name  Age  Salary
0   John    Doe   37   75000
1   Jane    Doe   35   77000
2   John  Smith   50  100000
3   Jake  Brown   32   35000 (4 rows x 4 columns)

>> g = df.groupby("L_Name")
>> g.count()
Doe      2
Brown    1
Smith    1
dtype: int64