arkouda.dataframe ================= .. py:module:: arkouda.dataframe Classes ------- .. autoapisummary:: arkouda.dataframe.DataFrame arkouda.dataframe.DataFrameGroupBy arkouda.dataframe.DiffAggregate Functions --------- .. autoapisummary:: arkouda.dataframe.intersect arkouda.dataframe.intx arkouda.dataframe.invert_permutation arkouda.dataframe.merge Module Contents --------------- .. py:class:: DataFrame(dict=None, /, **kwargs) Bases: :py:obj:`collections.UserDict` A DataFrame structure based on arkouda arrays. :param initialdata: Each list/dictionary entry corresponds to one column of the data and should be a homogenous type. Different columns may have different types. If using a dictionary, keys should be strings. :type initialdata: List or dictionary of lists, tuples, or pdarrays :param index: Index for the resulting frame. Defaults to an integer range. :type index: Index, pdarray, or Strings :param columns: Column labels to use if the data does not include them. Elements must be strings. Defaults to an stringified integer range. :type columns: List, tuple, pdarray, or Strings .. rubric:: Examples Create an empty DataFrame and add a column of data: >>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame() >>> df['a'] = ak.array([1,2,3]) >>> display(df) +----+-----+ | | a | +====+=====+ | 0 | 1 | +----+-----+ | 1 | 2 | +----+-----+ | 2 | 3 | +----+-----+ Create a new DataFrame using a dictionary of data: >>> userName = ak.array(['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice']) >>> userID = ak.array([111, 222, 111, 333, 222, 111]) >>> item = ak.array([0, 0, 1, 1, 2, 0]) >>> day = ak.array([5, 5, 6, 5, 6, 6]) >>> amount = ak.array([0.5, 0.6, 1.1, 1.2, 4.3, 0.6]) >>> df = ak.DataFrame({'userName': userName, 'userID': userID, >>> 'item': item, 'day': day, 'amount': amount}) >>> display(df) +----+------------+----------+--------+-------+----------+ | | userName | userID | item | day | amount | +====+============+==========+========+=======+==========+ | 0 | Alice | 111 | 0 | 5 | 0.5 | +----+------------+----------+--------+-------+----------+ | 1 | Bob | 222 | 0 | 5 | 0.6 | +----+------------+----------+--------+-------+----------+ | 2 | Alice | 111 | 1 | 6 | 1.1 | +----+------------+----------+--------+-------+----------+ | 3 | Carol | 333 | 1 | 5 | 1.2 | +----+------------+----------+--------+-------+----------+ | 4 | Bob | 222 | 2 | 6 | 4.3 | +----+------------+----------+--------+-------+----------+ | 5 | Alice | 111 | 0 | 6 | 0.6 | +----+------------+----------+--------+-------+----------+ Indexing works slightly differently than with pandas: >>> df[0] +------------+----------+ | keys | values | +============+==========+ | userName | Alice | +------------+----------+ |userID | 111 | +------------+----------+ | item | 0 | +------------+----------+ | day | 5 | +------------+----------+ | amount | 0.5 | +------------+----------+ >>> df['userID'] array([111, 222, 111, 333, 222, 111]) >>> df['userName'] array(['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice']) >>> df[ak.array([1,3,5])] +----+------------+----------+--------+-------+----------+ | | userName | userID | item | day | amount | +====+============+==========+========+=======+==========+ | 0 | Bob | 222 | 0 | 5 | 0.6 | +----+------------+----------+--------+-------+----------+ | 1 | Carol | 333 | 1 | 5 | 1.2 | +----+------------+----------+--------+-------+----------+ | 2 | Alice | 111 | 0 | 6 | 0.6 | +----+------------+----------+--------+-------+----------+ Compute the stride: >>> df[1:5:1] +----+------------+----------+--------+-------+----------+ | | userName | userID | item | day | amount | +====+============+==========+========+=======+==========+ | 0 | Bob | 222 | 0 | 5 | 0.6 | +----+------------+----------+--------+-------+----------+ | 1 | Alice | 111 | 1 | 6 | 1.1 | +----+------------+----------+--------+-------+----------+ | 2 | Carol | 333 | 1 | 5 | 1.2 | +----+------------+----------+--------+-------+----------+ | 3 | Bob | 222 | 2 | 6 | 4.3 | +----+------------+----------+--------+-------+----------+ >>> df[ak.array([1,2,3])] +----+------------+----------+--------+-------+----------+ | | userName | userID | item | day | amount | +====+============+==========+========+=======+==========+ | 0 | Bob | 222 | 0 | 5 | 0.6 | +----+------------+----------+--------+-------+----------+ | 1 | Alice | 111 | 1 | 6 | 1.1 | +----+------------+----------+--------+-------+----------+ | 2 | Carol | 333 | 1 | 5 | 1.2 | +----+------------+----------+--------+-------+----------+ >>> df[['userID', 'day']] +----+----------+-------+ | | userID | day | +====+==========+=======+ | 0 | 111 | 5 | +----+----------+-------+ | 1 | 222 | 5 | +----+----------+-------+ | 2 | 111 | 6 | +----+----------+-------+ | 3 | 333 | 5 | +----+----------+-------+ | 4 | 222 | 6 | +----+----------+-------+ | 5 | 111 | 6 | +----+----------+-------+ .. py:method:: GroupBy(keys, use_series=False, as_index=True, dropna=True) Group the dataframe by a column or a list of columns. :param keys: An (ordered) list of column names or a single string to group by. :type keys: str or list of str :param use_series: If True, returns an arkouda.dataframe.DataFrameGroupBy object. Otherwise an arkouda.groupbyclass.GroupBy object. :type use_series: bool, default=False :param as_index: If True, groupby columns will be set as index otherwise, the groupby columns will be treated as DataFrame columns. :type as_index: bool, default=True :param dropna: If True, and the groupby keys contain NaN values, the NaN values together with the corresponding row will be dropped. Otherwise, the rows corresponding to NaN values will be kept. :type dropna: bool, default=True :returns: If use_series = True, returns an arkouda.dataframe.DataFrameGroupBy object. Otherwise returns an arkouda.groupbyclass.GroupBy object. :rtype: arkouda.dataframe.DataFrameGroupBy or arkouda.groupbyclass.GroupBy .. seealso:: :obj:`arkouda.GroupBy` .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1.0, 1.0, 2.0, np.nan], 'col2': [4, 5, 6, 7]}) >>> df +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 1 | 4 | +----+--------+--------+ | 1 | 1 | 5 | +----+--------+--------+ | 2 | 2 | 6 | +----+--------+--------+ | 3 | nan | 7 | +----+--------+--------+ >>> df.GroupBy("col1") >>> df.GroupBy("col1").size() (array([1.00000000000000000 2.00000000000000000]), array([2 1])) >>> df.GroupBy("col1",use_series=True) col1 1.0 2 2.0 1 dtype: int64 >>> df.GroupBy("col1",use_series=True, as_index = False).size() +----+--------+--------+ | | col1 | size | +====+========+========+ | 0 | 1 | 2 | +----+--------+--------+ | 1 | 2 | 1 | +----+--------+--------+ .. py:method:: all(axis=0) -> Union[Series, bool] Return whether all elements are True, potentially over an axis. Returns True unless there at least one element along a Dataframe axis that is False. Currently, will ignore any columns that are not type bool. This is equivalent to the pandas option bool_only=True. :param axis: Indicate which axis or axes should be reduced. 0 / ‘index’ : reduce the index, return a Series whose index is the original column labels. 1 / ‘columns’ : reduce the columns, return a Series whose index is the original index. None : reduce all axes, return a scalar. :type axis: {0 or ‘index’, 1 or ‘columns’, None}, default = 0 :rtype: arkouda.series.Series or bool :raises ValueError: Raised if axis does not have a value in {0 or ‘index’, 1 or ‘columns’, None}. .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({"A":[True,True,True,False],"B":[True,True,True,False], ... "C":[True,False,True,False],"D":[True,True,True,True]}) +----+---------+---------+---------+--------+ | | A | B | C | D | +====+=========+=========+=========+========+ | 0 | True | True | True | True | +----+---------+---------+---------+--------+ | 1 | True | True | False | True | +----+---------+---------+---------+--------+ | 2 | True | True | True | True | +----+---------+---------+---------+--------+ | 3 | False | False | False | True | +----+---------+---------+---------+--------+ >>> df.all(axis=0) A False B False C False D True dtype: bool >>> df.all(axis=1) 0 True 1 False 2 True 3 False dtype: bool >>> df.all(axis=None) False .. py:method:: any(axis=0) -> Union[Series, bool] Return whether any element is True, potentially over an axis. Returns False unless there is at least one element along a Dataframe axis that is True. Currently, will ignore any columns that are not type bool. This is equivalent to the pandas option bool_only=True. :param axis: Indicate which axis or axes should be reduced. 0 / ‘index’ : reduce the index, return a Series whose index is the original column labels. 1 / ‘columns’ : reduce the columns, return a Series whose index is the original index. None : reduce all axes, return a scalar. :type axis: {0 or ‘index’, 1 or ‘columns’, None}, default = 0 :rtype: arkouda.series.Series or bool :raises ValueError: Raised if axis does not have a value in {0 or ‘index’, 1 or ‘columns’, None}. .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({"A":[True,True,True,False],"B":[True,True,True,False], ... "C":[True,False,True,False],"D":[False,False,False,False]}) +----+---------+---------+---------+---------+ | | A | B | C | D | +====+=========+=========+=========+=========+ | 0 | True | True | True | False | +----+---------+---------+---------+---------+ | 1 | True | True | False | False | +----+---------+---------+---------+---------+ | 2 | True | True | True | False | +----+---------+---------+---------+---------+ | 3 | False | False | False | False | +----+---------+---------+---------+---------+ >>> df.any(axis=0) A True B True C True D False dtype: bool >>> df.any(axis=1) 0 True 1 True 2 True 3 False dtype: bool >>> df.any(axis=None) True .. py:method:: append(other, ordered=True) Concatenate data from 'other' onto the end of this DataFrame, in place. Explicitly, use the arkouda concatenate function to append the data from each column in other to the end of self. This operation is done in place, in the sense that the underlying pdarrays are updated from the result of the arkouda concatenate function, rather than returning a new DataFrame object containing the result. :param other: The DataFrame object whose data will be appended to this DataFrame. :type other: DataFrame :param ordered: If False, allow rows to be interleaved for better performance (but data within a row remains together). By default, append all rows to the end, in input order. :type ordered: bool, default=True :returns: Appending occurs in-place, but result is returned for compatibility. :rtype: self .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> df1 = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 1 | 3 | +----+--------+--------+ | 1 | 2 | 4 | +----+--------+--------+ >>> df2 = ak.DataFrame({'col1': [3], 'col2': [5]}) +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 3 | 5 | +----+--------+--------+ >>> df1.append(df2) >>> df1 +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 1 | 3 | +----+--------+--------+ | 1 | 2 | 4 | +----+--------+--------+ | 2 | 3 | 5 | +----+--------+--------+ .. py:method:: apply_permutation(perm) Apply a permutation to an entire DataFrame. The operation is done in place and the original DataFrame will be modified. This may be useful if you want to unsort an DataFrame, or even to apply an arbitrary permutation such as the inverse of a sorting permutation. :param perm: A permutation array. Should be the same size as the data arrays, and should consist of the integers [0,size-1] in some order. Very minimal testing is done to ensure this is a permutation. :type perm: pdarray :rtype: None .. seealso:: :obj:`sort` .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 1 | 4 | +----+--------+--------+ | 1 | 2 | 5 | +----+--------+--------+ | 2 | 3 | 6 | +----+--------+--------+ >>> perm_arry = ak.array([0, 2, 1]) >>> df.apply_permutation(perm_arry) >>> display(df) +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 1 | 4 | +----+--------+--------+ | 1 | 3 | 6 | +----+--------+--------+ | 2 | 2 | 5 | +----+--------+--------+ .. py:method:: argsort(key, ascending=True) Return the permutation that sorts the dataframe by `key`. :param key: The key to sort on. :type key: str :param ascending: If true, sort the key in ascending order. Otherwise, sort the key in descending order. :type ascending: bool, default = True :returns: The permutation array that sorts the data on `key`. :rtype: arkouda.pdarrayclass.pdarray .. seealso:: :obj:`coargsort` .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1.1, 3.1, 2.1], 'col2': [6, 5, 4]}) >>> display(df) +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 1.1 | 6 | +----+--------+--------+ | 1 | 3.1 | 5 | +----+--------+--------+ | 2 | 2.1 | 4 | +----+--------+--------+ >>> df.argsort('col1') array([0 2 1]) >>> sorted_df1 = df[df.argsort('col1')] >>> display(sorted_df1) +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 1.1 | 6 | +----+--------+--------+ | 1 | 2.1 | 4 | +----+--------+--------+ | 2 | 3.1 | 5 | +----+--------+--------+ >>> df.argsort('col2') array([2 1 0]) >>> sorted_df2 = df[df.argsort('col2')] >>> display(sorted_df2) +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 2.1 | 4 | +----+--------+--------+ | 1 | 3.1 | 5 | +----+--------+--------+ | 2 | 1.1 | 6 | +----+--------+--------+ .. py:method:: assign(**kwargs) -> DataFrame Assign new columns to a DataFrame. Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten. :param \*\*kwargs: The column names are keywords. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn't check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned. :type \*\*kwargs: dict of {str: callable or Series} :returns: A new DataFrame with the new columns in addition to all the existing columns. :rtype: DataFrame .. rubric:: Notes Assigning multiple columns within the same ``assign`` is possible. Later items in '\*\*kwargs' may refer to newly created or modified columns in 'df'; items are computed and assigned into 'df' in order. .. rubric:: Examples >>> df = ak.DataFrame({'temp_c': [17.0, 25.0]}, ... index=['Portland', 'Berkeley']) >>> df temp_c Portland 17.0 Berkeley 25.0 Where the value is a callable, evaluated on `df`: >>> df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32) temp_c temp_f Portland 17.0 62.6 Berkeley 25.0 77.0 Alternatively, the same behavior can be achieved by directly referencing an existing Series or sequence: >>> df.assign(temp_f=df['temp_c'] * 9 / 5 + 32) temp_c temp_f Portland 17.0 62.6 Berkeley 25.0 77.0 You can create multiple columns within the same assign where one of the columns depends on another one defined within the same assign: >>> df.assign(temp_f=lambda x: x['temp_c'] * 9 / 5 + 32, ... temp_k=lambda x: (x['temp_f'] + 459.67) * 5 / 9) temp_c temp_f temp_k Portland 17.0 62.6 290.15 Berkeley 25.0 77.0 298.15 .. py:method:: attach(user_defined_name: str) -> DataFrame Function to return a DataFrame object attached to the registered name in the arkouda server which was registered using register(). :param user_defined_name: user defined name which DataFrame object was registered under. :type user_defined_name: str :returns: The DataFrame object created by re-attaching to the corresponding server components. :rtype: arkouda.dataframe.DataFrame :raises RegistrationError: if user_defined_name is not registered .. seealso:: :obj:`register`, :obj:`is_registered`, :obj:`unregister` .. rubric:: Example >>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) >>> df.register("my_table_name") >>> df.attach("my_table_name") >>> df.is_registered() True >>> df.unregister() >>> df.is_registered() False .. py:method:: coargsort(keys, ascending=True) Return the permutation that sorts the dataframe by `keys`. Note: Sorting using Strings may not yield correct sort order. :param keys: The keys to sort on. :type keys: list of str :returns: The permutation array that sorts the data on `keys`. :rtype: arkouda.pdarrayclass.pdarray .. rubric:: Example >>> df = ak.DataFrame({'col1': [2, 2, 1], 'col2': [3, 4, 3], 'col3':[5, 6, 7]}) >>> display(df) +----+--------+--------+--------+ | | col1 | col2 | col3 | +====+========+========+========+ | 0 | 2 | 3 | 5 | +----+--------+--------+--------+ | 1 | 2 | 4 | 6 | +----+--------+--------+--------+ | 2 | 1 | 3 | 7 | +----+--------+--------+--------+ >>> df.coargsort(['col1', 'col2']) array([2 0 1]) >>> .. py:property:: columns An Index where the values are the column names of the dataframe. :returns: The values of the index are the column names of the dataframe. :rtype: arkouda.index.Index .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> df +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 1 | 3 | +----+--------+--------+ | 1 | 2 | 4 | +----+--------+--------+ >>> df.columns Index(array(['col1', 'col2']), dtype=' DataFrame Return new DataFrame with pairwise correlation of columns. :returns: Arkouda DataFrame containing correlation matrix of all columns. :rtype: arkouda.dataframe.DataFrame :raises RuntimeError: Raised if there's a server-side error thrown. .. seealso:: :obj:`pdarray.corr` .. rubric:: Notes Generates the correlation matrix using Pearson R for all columns. Attempts to convert to numeric values where possible for inclusion in the matrix. .. rubric:: Example >>> df = ak.DataFrame({'col1': [1, 2], 'col2': [-1, -2]}) >>> display(df) +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 1 | -1 | +----+--------+--------+ | 1 | 2 | -2 | +----+--------+--------+ >>> corr = df.corr() +------+--------+--------+ | | col1 | col2 | +======+========+========+ | col1 | 1 | -1 | +------+--------+--------+ | col2 | -1 | 1 | +------+--------+--------+ .. py:method:: count(axis: Union[int, str] = 0, numeric_only=False) -> Series Count non-NA cells for each column or row. The values np.NaN are considered NA. :param axis: If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row. :type axis: {0 or 'index', 1 or 'columns'}, default 0 :param numeric_only: Include only float, int or boolean data. :type numeric_only: bool = False :returns: For each column/row the number of non-NA/null entries. :rtype: arkouda.series.Series :raises ValueError: Raised if axis is not 0, 1, 'index', or 'columns'. .. seealso:: :obj:`GroupBy.count` .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> import numpy as np >>> df = ak.DataFrame({'col_A': ak.array([7, np.nan]), 'col_B':ak.array([1, 9])}) >>> display(df) +----+---------+---------+ | | col_A | col_B | +====+=========+=========+ | 0 | 7 | 1 | +----+---------+---------+ | 1 | nan | 9 | +----+---------+---------+ >>> df.count() col_A 1 col_B 2 dtype: int64 >>> df = ak.DataFrame({'col_A': ak.array(["a","b","c"]), 'col_B':ak.array([1, np.nan, np.nan])}) >>> display(df) +----+---------+---------+ | | col_A | col_B | +====+=========+=========+ | 0 | a | 1 | +----+---------+---------+ | 1 | b | nan | +----+---------+---------+ | 2 | c | nan | +----+---------+---------+ >>> df.count() col_A 3 col_B 1 dtype: int64 >>> df.count(numeric_only=True) col_B 1 dtype: int64 >>> df.count(axis=1) 0 2 1 1 2 1 dtype: int64 .. py:method:: drop(keys: Union[str, int, List[Union[str, int]]], axis: Union[str, int] = 0, inplace: bool = False) -> Union[None, DataFrame] Drop column/s or row/s from the dataframe. :param keys: The labels to be dropped on the given axis. :type keys: str, int or list :param axis: The axis on which to drop from. 0/'index' - drop rows, 1/'columns' - drop columns. :type axis: int or str :param inplace: When True, perform the operation on the calling object. When False, return a new object. :type inplace: bool, default=False :returns: DateFrame when `inplace=False`; None when `inplace=True` :rtype: arkouda.dataframe.DataFrame or None .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> display(df) +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 1 | 3 | +----+--------+--------+ | 1 | 2 | 4 | +----+--------+--------+ Drop column >>> df.drop('col1', axis = 1) +----+--------+ | | col2 | +====+========+ | 0 | 3 | +----+--------+ | 1 | 4 | +----+--------+ Drop row >>> df.drop(0, axis = 0) +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 2 | 4 | +----+--------+--------+ .. py:method:: drop_duplicates(subset=None, keep='first') Drops duplcated rows and returns resulting DataFrame. If a subset of the columns are provided then only one instance of each duplicated row will be returned (keep determines which row). :param subset: Iterable of column names to use to dedupe. :type subset: Iterable :param keep: Determines which duplicates (if any) to keep. :type keep: {'first', 'last'}, default='first' :returns: DataFrame with duplicates removed. :rtype: arkouda.dataframe.DataFrame .. rubric:: Example >>> df = ak.DataFrame({'col1': [1, 2, 2, 3], 'col2': [4, 5, 5, 6]}) >>> display(df) +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 1 | 4 | +----+--------+--------+ | 1 | 2 | 5 | +----+--------+--------+ | 2 | 2 | 5 | +----+--------+--------+ | 3 | 3 | 6 | +----+--------+--------+ >>> df.drop_duplicates() +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 1 | 4 | +----+--------+--------+ | 1 | 2 | 5 | +----+--------+--------+ | 2 | 3 | 6 | +----+--------+--------+ .. py:method:: dropna(axis: Union[int, str] = 0, how: Optional[str] = None, thresh: Optional[int] = None, ignore_index: bool = False) -> DataFrame Remove missing values. :param axis: Determine if rows or columns which contain missing values are removed. 0, or 'index': Drop rows which contain missing values. 1, or 'columns': Drop columns which contain missing value. Only a single axis is allowed. :type axis: {0 or 'index', 1 or 'columns'}, default = 0 :param how: Determine if row or column is removed from DataFrame, when we have at least one NA or all NA. 'any': If any NA values are present, drop that row or column. 'all': If all values are NA, drop that row or column. :type how: {'any', 'all'}, default='any' :param thresh: Require that many non - NA values.Cannot be combined with how. :type thresh: int, optional :param ignore_index: If ``True``, the resulting axis will be labeled 0, 1, …, n - 1. :type ignore_index: bool, default ``False`` :returns: DataFrame with NA entries dropped from it. :rtype: arkouda.dataframe.DataFrame .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> import numpy as np >>> df = ak.DataFrame( { "A": [True, True, True, True], "B": [1, np.nan, 2, np.nan], "C": [1, 2, 3, np.nan], "D": [False, False, False, False], "E": [1, 2, 3, 4], "F": ["a", "b", "c", "d"], "G": [1, 2, 3, 4], } ) >>> display(df) +----+------+-----+-----+-------+-----+-----+-----+ | | A | B | C | D | E | F | G | +====+======+=====+=====+=======+=====+=====+=====+ | 0 | True | 1 | 1 | False | 1 | a | 1 | +----+------+-----+-----+-------+-----+-----+-----+ | 1 | True | nan | 2 | False | 2 | b | 2 | +----+------+-----+-----+-------+-----+-----+-----+ | 2 | True | 2 | 3 | False | 3 | c | 3 | +----+------+-----+-----+-------+-----+-----+-----+ | 3 | True | nan | nan | False | 4 | d | 4 | +----+------+-----+-----+-------+-----+-----+-----+ >>> df.dropna() +----+------+-----+-----+-------+-----+-----+-----+ | | A | B | C | D | E | F | G | +====+======+=====+=====+=======+=====+=====+=====+ | 0 | True | 1 | 1 | False | 1 | a | 1 | +----+------+-----+-----+-------+-----+-----+-----+ | 1 | True | 2 | 3 | False | 3 | c | 3 | +----+------+-----+-----+-------+-----+-----+-----+ >>> df.dropna(axis=1) +----+------+-------+-----+-----+-----+ | | A | D | E | F | G | +====+======+=======+=====+=====+=====+ | 0 | True | False | 1 | a | 1 | +----+------+-------+-----+-----+-----+ | 1 | True | False | 2 | b | 2 | +----+------+-------+-----+-----+-----+ | 2 | True | False | 3 | c | 3 | +----+------+-------+-----+-----+-----+ | 3 | True | False | 4 | d | 4 | +----+------+-------+-----+-----+-----+ >>> df.dropna(axis=1, thresh=3) +----+------+-----+-------+-----+-----+-----+ | | A | C | D | E | F | G | +====+======+=====+=======+=====+=====+=====+ | 0 | True | 1 | False | 1 | a | 1 | +----+------+-----+-------+-----+-----+-----+ | 1 | True | 2 | False | 2 | b | 2 | +----+------+-----+-------+-----+-----+-----+ | 2 | True | 3 | False | 3 | c | 3 | +----+------+-----+-------+-----+-----+-----+ | 3 | True | nan | False | 4 | d | 4 | +----+------+-----+-------+-----+-----+-----+ >>> df.dropna(axis=1, how="all") +----+------+-----+-----+-------+-----+-----+-----+ | | A | B | C | D | E | F | G | +====+======+=====+=====+=======+=====+=====+=====+ | 0 | True | 1 | 1 | False | 1 | a | 1 | +----+------+-----+-----+-------+-----+-----+-----+ | 1 | True | nan | 2 | False | 2 | b | 2 | +----+------+-----+-----+-------+-----+-----+-----+ | 2 | True | 2 | 3 | False | 3 | c | 3 | +----+------+-----+-----+-------+-----+-----+-----+ | 3 | True | nan | nan | False | 4 | d | 4 | +----+------+-----+-----+-------+-----+-----+-----+ .. py:property:: dtypes :type: DataFrame The dtypes of the dataframe. :returns: **dtypes** -- The dtypes of the dataframe. :rtype: arkouda.row.Row .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1, 2], 'col2': ["a", "b"]}) >>> df +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 1 | a | +----+--------+--------+ | 1 | 2 | b | +----+--------+--------+ >>> df.dtypes +----+--------+ |keys| values | +====+========+ |col1| int64 | +----+--------+ |col2| str | +----+--------+ .. py:property:: empty :type: DataFrame Whether the dataframe is empty. :returns: True if the dataframe is empty, otherwise False. :rtype: bool .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({}) >>> df 0 rows x 0 columns >>> df.empty True .. py:method:: filter_by_range(keys, low=1, high=None) Find all rows where the value count of the items in a given set of columns (keys) is within the range [low, high]. To filter by a specific value, set low == high. :param keys: The names of the columns to group by. :type keys: str or list of str :param low: The lowest value count. :type low: int, default=1 :param high: The highest value count, default to unlimited. :type high: int, default=None :returns: An array of boolean values for qualified rows in this DataFrame. :rtype: arkouda.pdarrayclass.pdarray .. rubric:: Example >>> df = ak.DataFrame({'col1': [1, 2, 2, 2, 3, 3], 'col2': [4, 5, 6, 7, 8, 9]}) >>> display(df) +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 1 | 4 | +----+--------+--------+ | 1 | 2 | 5 | +----+--------+--------+ | 2 | 2 | 6 | +----+--------+--------+ | 3 | 2 | 7 | +----+--------+--------+ | 4 | 3 | 8 | +----+--------+--------+ | 5 | 3 | 9 | +----+--------+--------+ >>> df.filter_by_range("col1", low=1, high=2) array([True False False False True True]) >>> filtered_df = df[df.filter_by_range("col1", low=1, high=2)] >>> display(filtered_df) +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 1 | 4 | +----+--------+--------+ | 1 | 3 | 8 | +----+--------+--------+ | 2 | 3 | 9 | +----+--------+--------+ .. py:method:: from_pandas(pd_df) Copy the data from a pandas DataFrame into a new arkouda.dataframe.DataFrame. :param pd_df: A pandas DataFrame to convert. :type pd_df: pandas.DataFrame :rtype: arkouda.dataframe.DataFrame .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> import pandas as pd >>> pd_df = pd.DataFrame({"A":[1,2],"B":[3,4]}) >>> type(pd_df) pandas.core.frame.DataFrame >>> display(pd_df) +----+-----+-----+ | | A | B | +====+=====+=====+ | 0 | 1 | 3 | +----+-----+-----+ | 1 | 2 | 4 | +----+-----+-----+ >>> ak_df = DataFrame.from_pandas(pd_df) >>> type(ak_df) arkouda.dataframe.DataFrame >>> display(ak_df) +----+-----+-----+ | | A | B | +====+=====+=====+ | 0 | 1 | 3 | +----+-----+-----+ | 1 | 2 | 4 | +----+-----+-----+ .. py:method:: from_return_msg(rep_msg) Creates a DataFrame object from an arkouda server response message. :param rep_msg: Server response message used to create a DataFrame. :type rep_msg: string :rtype: arkouda.dataframe.DataFrame .. py:method:: groupby(keys, use_series=True, as_index=True, dropna=True) Group the dataframe by a column or a list of columns. Alias for GroupBy. :param keys: An (ordered) list of column names or a single string to group by. :type keys: str or list of str :param use_series: If True, returns an arkouda.dataframe.DataFrameGroupBy object. Otherwise an arkouda.groupbyclass.GroupBy object. :type use_series: bool, default=True :param as_index: If True, groupby columns will be set as index otherwise, the groupby columns will be treated as DataFrame columns. :type as_index: bool, default=True :param dropna: If True, and the groupby keys contain NaN values, the NaN values together with the corresponding row will be dropped. Otherwise, the rows corresponding to NaN values will be kept. :type dropna: bool, default=True :returns: If use_series = True, returns an arkouda.dataframe.DataFrameGroupBy object. Otherwise returns an arkouda.groupbyclass.GroupBy object. :rtype: arkouda.dataframe.DataFrameGroupBy or arkouda.groupbyclass.GroupBy .. seealso:: :obj:`arkouda.GroupBy` .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1.0, 1.0, 2.0, np.nan], 'col2': [4, 5, 6, 7]}) >>> df +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 1 | 4 | +----+--------+--------+ | 1 | 1 | 5 | +----+--------+--------+ | 2 | 2 | 6 | +----+--------+--------+ | 3 | nan | 7 | +----+--------+--------+ >>> df.GroupBy("col1") >>> df.GroupBy("col1").size() (array([1.00000000000000000 2.00000000000000000]), array([2 1])) >>> df.GroupBy("col1",use_series=True) col1 1.0 2 2.0 1 dtype: int64 >>> df.GroupBy("col1",use_series=True, as_index = False).size() +----+--------+--------+ | | col1 | size | +====+========+========+ | 0 | 1 | 2 | +----+--------+--------+ | 1 | 2 | 1 | +----+--------+--------+ .. py:method:: head(n=5) Return the first `n` rows. This function returns the first `n` rows of the the dataframe. It is useful for quickly verifying data, for example, after sorting or appending rows. :param n: Number of rows to select. :type n: int, default = 5 :returns: The first `n` rows of the DataFrame. :rtype: arkouda.dataframe.DataFrame .. seealso:: :obj:`tail` .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': ak.arange(10), 'col2': -1 * ak.arange(10)}) >>> display(df) +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 0 | 0 | +----+--------+--------+ | 1 | 1 | -1 | +----+--------+--------+ | 2 | 2 | -2 | +----+--------+--------+ | 3 | 3 | -3 | +----+--------+--------+ | 4 | 4 | -4 | +----+--------+--------+ | 5 | 5 | -5 | +----+--------+--------+ | 6 | 6 | -6 | +----+--------+--------+ | 7 | 7 | -7 | +----+--------+--------+ | 8 | 8 | -8 | +----+--------+--------+ | 9 | 9 | -9 | +----+--------+--------+ >>> df.head() +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 0 | 0 | +----+--------+--------+ | 1 | 1 | -1 | +----+--------+--------+ | 2 | 2 | -2 | +----+--------+--------+ | 3 | 3 | -3 | +----+--------+--------+ | 4 | 4 | -4 | +----+--------+--------+ >>> df.head(n=2) +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 0 | 0 | +----+--------+--------+ | 1 | 1 | -1 | +----+--------+--------+ .. py:property:: index The index of the dataframe. :returns: The index of the dataframe. :rtype: arkouda.index.Index or arkouda.index.MultiIndex .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> df +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 1 | 3 | +----+--------+--------+ | 1 | 2 | 4 | +----+--------+--------+ >>> df.index Index(array([0 1]), dtype='int64') .. py:property:: info Returns a summary string of this dataframe. :returns: A summary string of this dataframe. :rtype: str .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1, 2], 'col2': ["a", "b"]}) >>> df +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 1 | a | +----+--------+--------+ | 1 | 2 | b | +----+--------+--------+ >>> df.info "DataFrame(['col1', 'col2'], 2 rows, 20 B)" .. py:method:: is_registered() -> bool Return True if the object is contained in the registry. :returns: Indicates if the object is contained in the registry. :rtype: bool :raises RegistrationError: Raised if there's a server-side error or a mismatch of registered components. .. seealso:: :obj:`register`, :obj:`attach`, :obj:`unregister`, :obj:`unregister_dataframe_by_name` .. rubric:: Notes Objects registered with the server are immune to deletion until they are unregistered. .. rubric:: Example >>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) >>> df.register("my_table_name") >>> df.attach("my_table_name") >>> df.is_registered() True >>> df.unregister() >>> df.is_registered() False .. py:method:: isin(values: Union[pdarray, Dict, Series, DataFrame]) -> DataFrame Determine whether each element in the DataFrame is contained in values. :param values: The values to check for in DataFrame. Series can only have a single index. :type values: pdarray, dict, Series, or DataFrame :returns: Arkouda DataFrame of booleans showing whether each element in the DataFrame is contained in values. :rtype: arkouda.dataframe.DataFrame .. seealso:: :obj:`ak.Series.isin` .. rubric:: Notes - Pandas supports values being an iterable type. In arkouda, we replace this with pdarray. - Pandas supports ~ operations. Currently, ak.DataFrame does not support this. .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col_A': ak.array([7, 3]), 'col_B':ak.array([1, 9])}) >>> display(df) +----+---------+---------+ | | col_A | col_B | +====+=========+=========+ | 0 | 7 | 1 | +----+---------+---------+ | 1 | 3 | 9 | +----+---------+---------+ When `values` is a pdarray, check every value in the DataFrame to determine if it exists in values. >>> df.isin(ak.array([0, 1])) +----+---------+---------+ | | col_A | col_B | +====+=========+=========+ | 0 | 0 | 1 | +----+---------+---------+ | 1 | 0 | 0 | +----+---------+---------+ When `values` is a dict, the values in the dict are passed to check the column indicated by the key. >>> df.isin({'col_A': ak.array([0, 3])}) +----+---------+---------+ | | col_A | col_B | +====+=========+=========+ | 0 | 0 | 0 | +----+---------+---------+ | 1 | 1 | 0 | +----+---------+---------+ When `values` is a Series, each column is checked if values is present positionally. This means that for `True` to be returned, the indexes must be the same. >>> i = ak.Index(ak.arange(2)) >>> s = ak.Series(data=[3, 9], index=i) >>> df.isin(s) +----+---------+---------+ | | col_A | col_B | +====+=========+=========+ | 0 | 0 | 0 | +----+---------+---------+ | 1 | 0 | 1 | +----+---------+---------+ When `values` is a DataFrame, the index and column must match. Note that 9 is not found because the column name does not match. >>> other_df = ak.DataFrame({'col_A':ak.array([7, 3]), 'col_C':ak.array([0, 9])}) >>> df.isin(other_df) +----+---------+---------+ | | col_A | col_B | +====+=========+=========+ | 0 | 1 | 0 | +----+---------+---------+ | 1 | 1 | 0 | +----+---------+---------+ .. py:method:: isna() -> DataFrame Detect missing values. Return a boolean same-sized object indicating if the values are NA. numpy.NaN values get mapped to True values. Everything else gets mapped to False values. :returns: Mask of bool values for each element in DataFrame that indicates whether an element is an NA value. :rtype: arkouda.dataframe.DataFrame .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> import numpy as np >>> df = ak.DataFrame({"A": [np.nan, 2, 2, 3], "B": [3, np.nan, 5, 6], ... "C": [1, np.nan, 2, np.nan], "D":["a","b","c","d"]}) >>> display(df) +----+-----+-----+-----+-----+ | | A | B | C | D | +====+=====+=====+=====+=====+ | 0 | nan | 3 | 1 | a | +----+-----+-----+-----+-----+ | 1 | 2 | nan | nan | b | +----+-----+-----+-----+-----+ | 2 | 2 | 5 | 2 | c | +----+-----+-----+-----+-----+ | 3 | 3 | 6 | nan | d | +----+-----+-----+-----+-----+ >>> df.isna() A B C D 0 True False False False 1 False True True False 2 False False False False 3 False False True False (4 rows x 4 columns) .. py:method:: load(prefix_path, file_format='INFER') Load dataframe from file. file_format needed for consistency with other load functions. :param prefix_path: The prefix path for the data. :type prefix_path: str :param file_format: :type file_format: string, default = "INFER" :returns: A dataframe loaded from the prefix_path. :rtype: arkouda.dataframe.DataFrame .. rubric:: Examples To store data in /my_data_LOCALE0000, use "/my_data" as the prefix. >>> import arkouda as ak >>> ak.connect() >>> import os.path >>> from pathlib import Path >>> my_path = os.path.join(os.getcwd(), 'hdf5_output','my_data') >>> Path(my_path).mkdir(parents=True, exist_ok=True) >>> df = ak.DataFrame({"A": ak.arange(5), "B": -1 * ak.arange(5)}) >>> df.save(my_path, file_type="distribute") >>> df.load(my_path) +----+-----+-----+ | | A | B | +====+=====+=====+ | 0 | 0 | 0 | +----+-----+-----+ | 1 | 1 | -1 | +----+-----+-----+ | 2 | 2 | -2 | +----+-----+-----+ | 3 | 3 | -3 | +----+-----+-----+ | 4 | 4 | -4 | +----+-----+-----+ .. py:method:: memory_usage(index=True, unit='B') -> Series Return the memory usage of each column in bytes. The memory usage can optionally include the contribution of the index. :param index: Specifies whether to include the memory usage of the DataFrame's index in returned Series. If ``index=True``, the memory usage of the index is the first item in the output. :type index: bool, default True :param unit: Unit to return. One of {'B', 'KB', 'MB', 'GB'}. :type unit: str, default = "B" :returns: A Series whose index is the original column names and whose values is the memory usage of each column in bytes. :rtype: Series .. seealso:: :obj:`arkouda.pdarrayclass.nbytes`, :obj:`arkouda.index.Index.memory_usage`, :obj:`arkouda.index.MultiIndex.memory_usage`, :obj:`arkouda.series.Series.memory_usage` .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> dtypes = [ak.int64, ak.float64, ak.bool] >>> data = dict([(str(t), ak.ones(5000, dtype=ak.int64).astype(t)) for t in dtypes]) >>> df = ak.DataFrame(data) >>> display(df.head()) +----+---------+-----------+--------+ | | int64 | float64 | bool | +====+=========+===========+========+ | 0 | 1 | 1 | True | +----+---------+-----------+--------+ | 1 | 1 | 1 | True | +----+---------+-----------+--------+ | 2 | 1 | 1 | True | +----+---------+-----------+--------+ | 3 | 1 | 1 | True | +----+---------+-----------+--------+ | 4 | 1 | 1 | True | +----+---------+-----------+--------+ >>> df.memory_usage() +---------+-------+ | | 0 | +=========+=======+ | Index | 40000 | +---------+-------+ | int64 | 40000 | +---------+-------+ | float64 | 40000 | +---------+-------+ | bool | 5000 | +---------+-------+ >>> df.memory_usage(index=False) +---------+-------+ | | 0 | +=========+=======+ | int64 | 40000 | +---------+-------+ | float64 | 40000 | +---------+-------+ | bool | 5000 | +---------+-------+ >>> df.memory_usage(unit="KB") +---------+----------+ | | 0 | +=========+==========+ | Index | 39.0625 | +---------+----------+ | int64 | 39.0625 | +---------+----------+ | float64 | 39.0625 | +---------+----------+ | bool | 4.88281 | +---------+----------+ To get the approximate total memory usage: >>> df.memory_usage(index=True).sum() .. py:method:: memory_usage_info(unit='GB') A formatted string representation of the size of this DataFrame. :param unit: Unit to return. One of {'KB', 'MB', 'GB'}. :type unit: str, default = "GB" :returns: A string representation of the number of bytes used by this DataFrame in [unit]s. :rtype: str .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': ak.arange(1000), 'col2': ak.arange(1000)}) >>> df.memory_usage_info() '0.00 GB' >>> df.memory_usage_info(unit="KB") '15 KB' .. py:method:: merge(right: DataFrame, on: Optional[Union[str, List[str]]] = None, how: str = 'inner', left_suffix: str = '_x', right_suffix: str = '_y', convert_ints: bool = True, sort: bool = True) -> DataFrame Merge Arkouda DataFrames with a database-style join. The resulting dataframe contains rows from both DataFrames as specified by the merge condition (based on the "how" and "on" parameters). Based on pandas merge functionality. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html :param right: The Right DataFrame to be joined. :type right: DataFrame :param on: The name or list of names of the DataFrame column(s) to join on. If on is None, this defaults to the intersection of the columns in both DataFrames. :type on: Optional[Union[str, List[str]]] = None :param how: The merge condition. Must be "inner", "left", or "right". :type how: {"inner", "left", "right}, default = "inner" :param left_suffix: A string indicating the suffix to add to columns from the left dataframe for overlapping column names in both left and right. Defaults to "_x". Only used when how is "inner". :type left_suffix: str, default = "_x" :param right_suffix: A string indicating the suffix to add to columns from the right dataframe for overlapping column names in both left and right. Defaults to "_y". Only used when how is "inner". :type right_suffix: str, default = "_y" :param convert_ints: If True, convert columns with missing int values (due to the join) to float64. This is to match pandas. If False, do not convert the column dtypes. This has no effect when how = "inner". :type convert_ints: bool = True :param sort: If True, DataFrame is returned sorted by "on". Otherwise, the DataFrame is not sorted. :type sort: bool = True :returns: Joined Arkouda DataFrame. :rtype: arkouda.dataframe.DataFrame .. note:: Multiple column joins are only supported for integer columns. .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> left_df = ak.DataFrame({'col1': ak.arange(5), 'col2': -1 * ak.arange(5)}) >>> display(left_df) +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 0 | 0 | +----+--------+--------+ | 1 | 1 | -1 | +----+--------+--------+ | 2 | 2 | -2 | +----+--------+--------+ | 3 | 3 | -3 | +----+--------+--------+ | 4 | 4 | -4 | +----+--------+--------+ >>> right_df = ak.DataFrame({'col1': 2 * ak.arange(5), 'col2': 2 * ak.arange(5)}) >>> display(right_df) +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 0 | 0 | +----+--------+--------+ | 1 | 2 | 2 | +----+--------+--------+ | 2 | 4 | 4 | +----+--------+--------+ | 3 | 6 | 6 | +----+--------+--------+ | 4 | 8 | 8 | +----+--------+--------+ >>> left_df.merge(right_df, on = "col1") +----+--------+----------+----------+ | | col1 | col2_x | col2_y | +====+========+==========+==========+ | 0 | 0 | 0 | 0 | +----+--------+----------+----------+ | 1 | 2 | -2 | 2 | +----+--------+----------+----------+ | 2 | 4 | -4 | 4 | +----+--------+----------+----------+ >>> left_df.merge(right_df, on = "col1", how = "left") +----+--------+----------+----------+ | | col1 | col2_y | col2_x | +====+========+==========+==========+ | 0 | 0 | 0 | 0 | +----+--------+----------+----------+ | 1 | 1 | nan | -1 | +----+--------+----------+----------+ | 2 | 2 | 2 | -2 | +----+--------+----------+----------+ | 3 | 3 | nan | -3 | +----+--------+----------+----------+ | 4 | 4 | 4 | -4 | +----+--------+----------+----------+ >>> left_df.merge(right_df, on = "col1", how = "right") +----+--------+----------+----------+ | | col1 | col2_x | col2_y | +====+========+==========+==========+ | 0 | 0 | 0 | 0 | +----+--------+----------+----------+ | 1 | 2 | -2 | 2 | +----+--------+----------+----------+ | 2 | 4 | -4 | 4 | +----+--------+----------+----------+ | 3 | 6 | nan | 6 | +----+--------+----------+----------+ | 4 | 8 | nan | 8 | +----+--------+----------+----------+ >>> left_df.merge(right_df, on = "col1", how = "outer") +----+--------+----------+----------+ | | col1 | col2_y | col2_x | +====+========+==========+==========+ | 0 | 0 | 0 | 0 | +----+--------+----------+----------+ | 1 | 1 | nan | -1 | +----+--------+----------+----------+ | 2 | 2 | 2 | -2 | +----+--------+----------+----------+ | 3 | 3 | nan | -3 | +----+--------+----------+----------+ | 4 | 4 | 4 | -4 | +----+--------+----------+----------+ | 5 | 6 | 6 | nan | +----+--------+----------+----------+ | 6 | 8 | 8 | nan | +----+--------+----------+----------+ .. py:method:: notna() -> DataFrame Detect existing (non-missing) values. Return a boolean same-sized object indicating if the values are not NA. numpy.NaN values get mapped to False values. :returns: Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value. :rtype: arkouda.dataframe.DataFrame .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> import numpy as np >>> df = ak.DataFrame({"A": [np.nan, 2, 2, 3], "B": [3, np.nan, 5, 6], ... "C": [1, np.nan, 2, np.nan], "D":["a","b","c","d"]}) >>> display(df) +----+-----+-----+-----+-----+ | | A | B | C | D | +====+=====+=====+=====+=====+ | 0 | nan | 3 | 1 | a | +----+-----+-----+-----+-----+ | 1 | 2 | nan | nan | b | +----+-----+-----+-----+-----+ | 2 | 2 | 5 | 2 | c | +----+-----+-----+-----+-----+ | 3 | 3 | 6 | nan | d | +----+-----+-----+-----+-----+ >>> df.notna() A B C D 0 False True True True 1 True False False True 2 True True True True 3 True True False True (4 rows x 4 columns) .. py:method:: objType(*args, **kwargs) str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'. .. py:method:: read_csv(filename: str, col_delim: str = ',') Read the columns of a CSV file into an Arkouda DataFrame. If the file contains the appropriately formatted header, typed data will be returned. Otherwise, all data will be returned as a Strings objects. :param filename: Filename to read data from. :type filename: str :param col_delim: The delimiter for columns within the data. :type col_delim: str, default="," :returns: Arkouda DataFrame containing the columns from the CSV file. :rtype: arkouda.dataframe.DataFrame :raises ValueError: Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist. :raises RuntimeError: Raised if one or more of the specified files cannot be opened. If `allow_errors` is true this may be raised if no values are returned from the server. :raises TypeError: Raised if we receive an unknown arkouda_type returned from the server. .. seealso:: :obj:`to_csv` .. rubric:: Notes - CSV format is not currently supported by load/load_all operations. - The column delimiter is expected to be the same for column names and data. - Be sure that column delimiters are not found within your data. - All CSV files must delimit rows using newline ("\\n") at this time. - Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8). .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> import os.path >>> from pathlib import Path >>> my_path = os.path.join(os.getcwd(), 'csv_output','my_data') >>> Path(my_path).mkdir(parents=True, exist_ok=True) >>> df = ak.DataFrame({"A":[1,2],"B":[3,4]}) >>> df.to_csv(my_path) >>> df2 = DataFrame.read_csv(my_path + "_LOCALE0000") >>> display(df2) +----+-----+-----+ | | A | B | +====+=====+=====+ | 0 | 1 | 3 | +----+-----+-----+ | 1 | 2 | 4 | +----+-----+-----+ .. py:method:: register(user_defined_name: str) -> DataFrame Register this DataFrame object and underlying components with the Arkouda server. :param user_defined_name: User defined name the DataFrame is to be registered under. This will be the root name for underlying components. :type user_defined_name: str :returns: The same DataFrame which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different DataFrames with the same name. :rtype: arkouda.dataframe.DataFrame :raises TypeError: Raised if user_defined_name is not a str. :raises RegistrationError: If the server was unable to register the DataFrame with the user_defined_name. .. seealso:: :obj:`unregister`, :obj:`attach`, :obj:`unregister_dataframe_by_name`, :obj:`is_registered` .. rubric:: Notes Objects registered with the server are immune to deletion until they are unregistered. Any changes made to a DataFrame object after registering with the server may not be reflected in attached copies. .. rubric:: Example >>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) >>> df.register("my_table_name") >>> df.attach("my_table_name") >>> df.is_registered() True >>> df.unregister() >>> df.is_registered() False .. py:method:: rename(mapper: Optional[Union[Callable, Dict]] = None, index: Optional[Union[Callable, Dict]] = None, column: Optional[Union[Callable, Dict]] = None, axis: Union[str, int] = 0, inplace: bool = False) -> Optional[DataFrame] Rename indexes or columns according to a mapping. :param mapper: Function or dictionary mapping existing values to new values. Nonexistent names will not raise an error. Uses the value of axis to determine if renaming column or index :type mapper: callable or dict-like, Optional :param column: Function or dictionary mapping existing column names to new column names. Nonexistent names will not raise an error. When this is set, axis is ignored. :type column: callable or dict-like, Optional :param index: Function or dictionary mapping existing index names to new index names. Nonexistent names will not raise an error. When this is set, axis is ignored. :type index: callable or dict-like, Optional :param axis: Indicates which axis to perform the rename. 0/"index" - Indexes 1/"column" - Columns :type axis: int or str, default=0 :param inplace: When True, perform the operation on the calling object. When False, return a new object. :type inplace: bool, default=False :returns: DateFrame when `inplace=False`; None when `inplace=True`. :rtype: arkouda.dataframe.DataFrame or None .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({"A": ak.array([1, 2, 3]), "B": ak.array([4, 5, 6])}) >>> display(df) +----+-----+-----+ | | A | B | +====+=====+=====+ | 0 | 1 | 4 | +----+-----+-----+ | 1 | 2 | 5 | +----+-----+-----+ | 2 | 3 | 6 | +----+-----+-----+ Rename columns using a mapping: >>> df.rename(column={'A':'a', 'B':'c'}) +----+-----+-----+ | | a | c | +====+=====+=====+ | 0 | 1 | 4 | +----+-----+-----+ | 1 | 2 | 5 | +----+-----+-----+ | 2 | 3 | 6 | +----+-----+-----+ Rename indexes using a mapping: >>> df.rename(index={0:99, 2:11}) +----+-----+-----+ | | A | B | +====+=====+=====+ | 0 | 1 | 4 | +----+-----+-----+ | 1 | 2 | 5 | +----+-----+-----+ | 2 | 3 | 6 | +----+-----+-----+ Rename using an axis style parameter: >>> df.rename(str.lower, axis='column') +----+-----+-----+ | | a | b | +====+=====+=====+ | 0 | 1 | 4 | +----+-----+-----+ | 1 | 2 | 5 | +----+-----+-----+ | 2 | 3 | 6 | +----+-----+-----+ .. py:method:: reset_index(size: Optional[int] = None, inplace: bool = False) -> Union[None, DataFrame] Set the index to an integer range. Useful if this dataframe is the result of a slice operation from another dataframe, or if you have permuted the rows and no longer need to keep that ordering on the rows. :param size: If size is passed, do not attempt to determine size based on existing column sizes. Assume caller handles consistency correctly. :type size: int, optional :param inplace: When True, perform the operation on the calling object. When False, return a new object. :type inplace: bool, default=False :returns: DateFrame when `inplace=False`; None when `inplace=True`. :rtype: arkouda.dataframe.DataFrame or None .. note:: Pandas adds a column 'index' to indicate the original index. Arkouda does not currently support this behavior. .. rubric:: Example >>> df = ak.DataFrame({"A": ak.array([1, 2, 3]), "B": ak.array([4, 5, 6])}) >>> display(df) +----+-----+-----+ | | A | B | +====+=====+=====+ | 0 | 1 | 4 | +----+-----+-----+ | 1 | 2 | 5 | +----+-----+-----+ | 2 | 3 | 6 | +----+-----+-----+ >>> perm_df = df[ak.array([0,2,1])] >>> display(perm_df) +----+-----+-----+ | | A | B | +====+=====+=====+ | 0 | 1 | 4 | +----+-----+-----+ | 1 | 3 | 6 | +----+-----+-----+ | 2 | 2 | 5 | +----+-----+-----+ >>> perm_df.reset_index() +----+-----+-----+ | | A | B | +====+=====+=====+ | 0 | 1 | 4 | +----+-----+-----+ | 1 | 3 | 6 | +----+-----+-----+ | 2 | 2 | 5 | +----+-----+-----+ .. py:method:: sample(n=5) Return a random sample of `n` rows. :param n: Number of rows to return. :type n: int, default=5 :returns: The sampled `n` rows of the DataFrame. :rtype: arkouda.dataframe.DataFrame .. rubric:: Example >>> df = ak.DataFrame({"A": ak.arange(5), "B": -1 * ak.arange(5)}) >>> display(df) +----+-----+-----+ | | A | B | +====+=====+=====+ | 0 | 0 | 0 | +----+-----+-----+ | 1 | 1 | -1 | +----+-----+-----+ | 2 | 2 | -2 | +----+-----+-----+ | 3 | 3 | -3 | +----+-----+-----+ | 4 | 4 | -4 | +----+-----+-----+ Random output of size 3: >>> df.sample(n=3) +----+-----+-----+ | | A | B | +====+=====+=====+ | 0 | 0 | 0 | +----+-----+-----+ | 1 | 1 | -1 | +----+-----+-----+ | 2 | 4 | -4 | +----+-----+-----+ .. py:method:: save(path, index=False, columns=None, file_format='HDF5', file_type='distribute', compression: Optional[str] = None) DEPRECATED Save DataFrame to disk, preserving column names. :param path: File path to save data. :type path: str :param index: If True, save the index column. By default, do not save the index. :type index: bool, default=False :param columns: List of columns to include in the file. If None, writes out all columns. :type columns: list, default=None :param file_format: 'HDF5' or 'Parquet'. Defaults to 'HDF5' :type file_format: str, default='HDF5' :param file_type: "single" or "distribute" If single, will right a single file to locale 0. :type file_type: str, default=distribute :param compression: (None | "snappy" | "gzip" | "brotli" | "zstd" | "lz4") Compression type. Only used for Parquet :type compression: str (Optional) .. rubric:: Notes This method saves one file per locale of the arkouda server. All files are prefixed by the path argument and suffixed by their locale number. .. seealso:: :obj:`to_parquet`, :obj:`to_hdf` .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> import os.path >>> from pathlib import Path >>> my_path = os.path.join(os.getcwd(), 'hdf5_output') >>> Path(my_path).mkdir(parents=True, exist_ok=True) >>> df = ak.DataFrame({"A": ak.arange(5), "B": -1 * ak.arange(5)}) >>> df.save(my_path + '/my_data', file_type="single") >>> df.load(my_path + '/my_data') +----+-----+-----+ | | A | B | +====+=====+=====+ | 0 | 0 | 0 | +----+-----+-----+ | 1 | 1 | -1 | +----+-----+-----+ | 2 | 2 | -2 | +----+-----+-----+ | 3 | 3 | -3 | +----+-----+-----+ | 4 | 4 | -4 | +----+-----+-----+ .. py:property:: shape The shape of the dataframe. :returns: Tuple of array dimensions. :rtype: tuple of int .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) >>> df +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 1 | 4 | +----+--------+--------+ | 1 | 2 | 5 | +----+--------+--------+ | 2 | 3 | 6 | +----+--------+--------+ >>> df.shape (3, 2) .. py:property:: size Returns the number of bytes on the arkouda server. :returns: The number of bytes on the arkouda server. :rtype: int .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) >>> df +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 1 | 4 | +----+--------+--------+ | 1 | 2 | 5 | +----+--------+--------+ | 2 | 3 | 6 | +----+--------+--------+ >>> df.size 6 .. py:method:: sort_index(ascending=True) Sort the DataFrame by indexed columns. Note: Fails on sort order of arkouda.strings.Strings columns when multiple columns being sorted. :param ascending: Sort values in ascending (default) or descending order. :type ascending: bool, default = True .. rubric:: Example >>> df = ak.DataFrame({'col1': [1.1, 3.1, 2.1], 'col2': [6, 5, 4]}, ... index = Index(ak.array([2,0,1]), name="idx")) >>> display(df) +----+--------+--------+ | idx| col1 | col2 | +====+========+========+ | 0 | 1.1 | 6 | +----+--------+--------+ | 1 | 3.1 | 5 | +----+--------+--------+ | 2 | 2.1 | 4 | +----+--------+--------+ >>> df.sort_index() +----+--------+--------+ | idx| col1 | col2 | +====+========+========+ | 0 | 3.1 | 5 | +----+--------+--------+ | 1 | 2.1 | 4 | +----+--------+--------+ | 2 | 1.1 | 6 | +----+--------+--------+ .. py:method:: sort_values(by=None, ascending=True) Sort the DataFrame by one or more columns. If no column is specified, all columns are used. Note: Fails on order of arkouda.strings.Strings columns when multiple columns being sorted. :param by: The name(s) of the column(s) to sort by. :type by: str or list/tuple of str, default = None :param ascending: Sort values in ascending (default) or descending order. :type ascending: bool, default = True .. seealso:: :obj:`apply_permutation` .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': [2, 2, 1], 'col2': [3, 4, 3], 'col3':[5, 6, 7]}) >>> display(df) +----+--------+--------+--------+ | | col1 | col2 | col3 | +====+========+========+========+ | 0 | 2 | 3 | 5 | +----+--------+--------+--------+ | 1 | 2 | 4 | 6 | +----+--------+--------+--------+ | 2 | 1 | 3 | 7 | +----+--------+--------+--------+ >>> df.sort_values() +----+--------+--------+--------+ | | col1 | col2 | col3 | +====+========+========+========+ | 0 | 1 | 3 | 7 | +----+--------+--------+--------+ | 1 | 2 | 3 | 5 | +----+--------+--------+--------+ | 2 | 2 | 4 | 6 | +----+--------+--------+--------+ >>> df.sort_values("col3") +----+--------+--------+--------+ | | col1 | col2 | col3 | +====+========+========+========+ | 0 | 1 | 3 | 7 | +----+--------+--------+--------+ | 1 | 2 | 3 | 5 | +----+--------+--------+--------+ | 2 | 2 | 4 | 6 | +----+--------+--------+--------+ .. py:method:: tail(n=5) Return the last `n` rows. This function returns the last `n` rows for the dataframe. It is useful for quickly testing if your object has the right type of data in it. :param n: Number of rows to select. :type n: int, default=5 :returns: The last `n` rows of the DataFrame. :rtype: arkouda.dataframe.DataFrame .. seealso:: :obj:`arkouda.dataframe.head` .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({'col1': ak.arange(10), 'col2': -1 * ak.arange(10)}) >>> display(df) +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 0 | 0 | +----+--------+--------+ | 1 | 1 | -1 | +----+--------+--------+ | 2 | 2 | -2 | +----+--------+--------+ | 3 | 3 | -3 | +----+--------+--------+ | 4 | 4 | -4 | +----+--------+--------+ | 5 | 5 | -5 | +----+--------+--------+ | 6 | 6 | -6 | +----+--------+--------+ | 7 | 7 | -7 | +----+--------+--------+ | 8 | 8 | -8 | +----+--------+--------+ | 9 | 9 | -9 | +----+--------+--------+ >>> df.tail() +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 5 | -5 | +----+--------+--------+ | 1 | 6 | -6 | +----+--------+--------+ | 2 | 7 | -7 | +----+--------+--------+ | 3 | 8 | -8 | +----+--------+--------+ | 4 | 9 | -9 | +----+--------+--------+ >>> df.tail(n=2) +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 8 | -8 | +----+--------+--------+ | 1 | 9 | -9 | +----+--------+--------+ .. py:method:: to_csv(path: str, index: bool = False, columns: Optional[List[str]] = None, col_delim: str = ',', overwrite: bool = False) Writes DataFrame to CSV file(s). File will contain a column for each column in the DataFrame. All CSV Files written by Arkouda include a header denoting data types of the columns. Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8). :param path: The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk. :type path: str :param index: If True, the index of the DataFrame will be written to the file as a column. :type index: bool, default=False :param columns: Column names to assign when writing data. :type columns: list of str (Optional) :param col_delim: Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset. :type col_delim: str, default="," :param overwrite: If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found. :type overwrite: bool, default=False :rtype: None :raises ValueError: Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist. :raises RuntimeError: Raised if one or more of the specified files cannot be opened. If `allow_errors` is true this may be raised if no values are returned from the server. :raises TypeError: Raised if we receive an unknown arkouda_type returned from the server. .. rubric:: Notes - CSV format is not currently supported by load/load_all operations. - The column delimiter is expected to be the same for column names and data. - Be sure that column delimiters are not found within your data. - All CSV files must delimit rows using newline ("\\n") at this time. .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> import os.path >>> from pathlib import Path >>> my_path = os.path.join(os.getcwd(), 'csv_output') >>> Path(my_path).mkdir(parents=True, exist_ok=True) >>> df = ak.DataFrame({"A":[1,2],"B":[3,4]}) >>> df.to_csv(my_path + "/my_data") >>> df2 = DataFrame.read_csv(my_path + "/my_data" + "_LOCALE0000") >>> display(df2) +----+-----+-----+ | | A | B | +====+=====+=====+ | 0 | 1 | 3 | +----+-----+-----+ | 1 | 2 | 4 | +----+-----+-----+ .. py:method:: to_hdf(path, index=False, columns=None, file_type='distribute') Save DataFrame to disk as hdf5, preserving column names. :param path: File path to save data. :type path: str :param index: If True, save the index column. By default, do not save the index. :type index: bool, default=False :param columns: List of columns to include in the file. If None, writes out all columns. :type columns: List, default = None :param file_type: Whether to save to a single file or distribute across Locales. :type file_type: str (single | distribute), default=distribute :rtype: None :raises RuntimeError: Raised if a server-side error is thrown saving the pdarray. .. rubric:: Notes This method saves one file per locale of the arkouda server. All files are prefixed by the path argument and suffixed by their locale number. .. seealso:: :obj:`to_parquet`, :obj:`load` .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> import os.path >>> from pathlib import Path >>> my_path = os.path.join(os.getcwd(), 'hdf_output') >>> Path(my_path).mkdir(parents=True, exist_ok=True) >>> df = ak.DataFrame({"A":[1,2],"B":[3,4]}) >>> df.to_hdf(my_path + "/my_data") >>> df.load(my_path + "/my_data") +----+-----+-----+ | | A | B | +====+=====+=====+ | 0 | 1 | 3 | +----+-----+-----+ | 1 | 2 | 4 | +----+-----+-----+ .. py:method:: to_markdown(mode='wt', index=True, tablefmt='grid', storage_options=None, **kwargs) Print DataFrame in Markdown-friendly format. :param mode: Mode in which file is opened, "wt" by default. :type mode: str, optional :param index: Add index (row) labels. :type index: bool, optional, default True :param tablefmt: Table format to call from tablulate: https://pypi.org/project/tabulate/ :type tablefmt: str = "grid" :param storage_options: Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc., if using a URL that will be parsed by fsspec, e.g., starting “s3://”, “gcs://”. An error will be raised if providing this argument with a non-fsspec URL. See the fsspec and backend storage implementation docs for the set of allowed keys and values. :type storage_options: dict, optional :param \*\*kwargs: These parameters will be passed to tabulate. .. note:: This function should only be called on small DataFrames as it calls pandas.DataFrame.to_markdown: https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.DataFrame.to_markdown.html .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({"animal_1": ["elk", "pig"], "animal_2": ["dog", "quetzal"]}) >>> print(df.to_markdown()) +----+------------+------------+ | | animal_1 | animal_2 | +====+============+============+ | 0 | elk | dog | +----+------------+------------+ | 1 | pig | quetzal | +----+------------+------------+ Suppress the index: >>> print(df.to_markdown(index = False)) +------------+------------+ | animal_1 | animal_2 | +============+============+ | elk | dog | +------------+------------+ | pig | quetzal | +------------+------------+ .. py:method:: to_pandas(datalimit=1073741824, retain_index=False) Send this DataFrame to a pandas DataFrame. :param datalimit: The maximum number size, in megabytes to transfer. The requested DataFrame will be converted to a pandas DataFrame only if the estimated size of the DataFrame does not exceed this value. :type datalimit: int, default=arkouda.client.maxTransferBytes :param retain_index: Normally, to_pandas() creates a new range index object. If you want to keep the index column, set this to True. :type retain_index: bool, default=False :returns: The result of converting this DataFrame to a pandas DataFrame. :rtype: pandas.DataFrame .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> ak_df = ak.DataFrame({"A": ak.arange(2), "B": -1 * ak.arange(2)}) >>> type(ak_df) arkouda.dataframe.DataFrame >>> display(ak_df) +----+-----+-----+ | | A | B | +====+=====+=====+ | 0 | 0 | 0 | +----+-----+-----+ | 1 | 1 | -1 | +----+-----+-----+ >>> import pandas as pd >>> pd_df = ak_df.to_pandas() >>> type(pd_df) pandas.core.frame.DataFrame >>> display(pd_df) +----+-----+-----+ | | A | B | +====+=====+=====+ | 0 | 0 | 0 | +----+-----+-----+ | 1 | 1 | -1 | +----+-----+-----+ .. py:method:: to_parquet(path, index=False, columns=None, compression: Optional[str] = None, convert_categoricals: bool = False) Save DataFrame to disk as parquet, preserving column names. :param path: File path to save data. :type path: str :param index: If True, save the index column. By default, do not save the index. :type index: bool, default=False :param columns: List of columns to include in the file. If None, writes out all columns. :type columns: list :param compression: Provide the compression type to use when writing the file. Supported values: snappy, gzip, brotli, zstd, lz4 :type compression: str (Optional), default=None :param convert_categoricals: Parquet requires all columns to be the same size and Categoricals don't satisfy that requirement. If set, write the equivalent Strings in place of any Categorical columns. :type convert_categoricals: bool, default=False :rtype: None :raises RuntimeError: Raised if a server-side error is thrown saving the pdarray .. rubric:: Notes This method saves one file per locale of the arkouda server. All files are prefixed by the path argument and suffixed by their locale number. .. seealso:: :obj:`to_hdf`, :obj:`load` .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> import os.path >>> from pathlib import Path >>> my_path = os.path.join(os.getcwd(), 'parquet_output') >>> Path(my_path).mkdir(parents=True, exist_ok=True) >>> df = ak.DataFrame({"A":[1,2],"B":[3,4]}) >>> df.to_parquet(my_path + "/my_data") >>> df.load(my_path + "/my_data") +----+-----+-----+ | | B | A | +====+=====+=====+ | 0 | 3 | 1 | +----+-----+-----+ | 1 | 4 | 2 | +----+-----+-----+ .. py:method:: transfer(hostname, port) Sends a DataFrame to a different Arkouda server. :param hostname: The hostname where the Arkouda server intended to receive the DataFrame is running. :type hostname: str :param port: The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up `numLocales` ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as `port`, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to `ak.receive_array()`. :type port: int_scalars :returns: A message indicating a complete transfer. :rtype: str :raises ValueError: Raised if the op is not within the pdarray.BinOps set :raises TypeError: Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype .. py:method:: unregister() Unregister this DataFrame object in the arkouda server which was previously registered using register() and/or attached to using attach(). :raises RegistrationError: If the object is already unregistered or if there is a server error when attempting to unregister. .. seealso:: :obj:`register`, :obj:`attach`, :obj:`unregister_dataframe_by_name`, :obj:`is_registered` .. rubric:: Notes Objects registered with the server are immune to deletion until they are unregistered. .. rubric:: Example >>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) >>> df.register("my_table_name") >>> df.attach("my_table_name") >>> df.is_registered() True >>> df.unregister() >>> df.is_registered() False .. py:method:: unregister_dataframe_by_name(user_defined_name: str) -> str Function to unregister DataFrame object by name which was registered with the arkouda server via register(). :param user_defined_name: Name under which the DataFrame object was registered. :type user_defined_name: str :raises TypeError: If user_defined_name is not a string. :raises RegistrationError: If there is an issue attempting to unregister any underlying components. .. seealso:: :obj:`register`, :obj:`unregister`, :obj:`attach`, :obj:`is_registered` .. rubric:: Example >>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) >>> df.register("my_table_name") >>> df.attach("my_table_name") >>> df.is_registered() True >>> df.unregister_dataframe_by_name("my_table_name") >>> df.is_registered() False .. py:method:: update_hdf(prefix_path: str, index=False, columns=None, repack: bool = True) Overwrite the dataset with the name provided with this dataframe. If the dataset does not exist it is added. :param prefix_path: Directory and filename prefix that all output files share. :type prefix_path: str :param index: If True, save the index column. By default, do not save the index. :type index: bool, default=False :param columns: List of columns to include in the file. If None, writes out all columns. :type columns: List, default=None :param repack: HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand. :type repack: bool, default=True :returns: Success message if successful. :rtype: str :raises RuntimeError: Raised if a server-side error is thrown saving the pdarray. .. rubric:: Notes If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed. If the dataset provided does not exist, it will be added. .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> import os.path >>> from pathlib import Path >>> my_path = os.path.join(os.getcwd(), 'hdf_output') >>> Path(my_path).mkdir(parents=True, exist_ok=True) >>> df = ak.DataFrame({"A":[1,2],"B":[3,4]}) >>> df.to_hdf(my_path + "/my_data") >>> df.load(my_path + "/my_data") +----+-----+-----+ | | A | B | +====+=====+=====+ | 0 | 1 | 3 | +----+-----+-----+ | 1 | 2 | 4 | +----+-----+-----+ >>> df2 = ak.DataFrame({"A":[5,6],"B":[7,8]}) >>> df2.update_hdf(my_path + "/my_data") >>> df.load(my_path + "/my_data") +----+-----+-----+ | | A | B | +====+=====+=====+ | 0 | 5 | 7 | +----+-----+-----+ | 1 | 6 | 8 | +----+-----+-----+ .. py:method:: update_nrows() Computes the number of rows on the arkouda server and updates the size parameter. .. py:class:: DataFrameGroupBy A DataFrame that has been grouped by a subset of columns. :param gb_key_names: The column name(s) associated with the aggregated columns. :type gb_key_names: str or list(str), default=None :param as_index: If True, interpret aggregated column as index (only implemented for single dimensional aggregates). Otherwise, treat aggregated column as a dataframe column. :type as_index: bool, default=True .. attribute:: gb GroupBy object, where the aggregation keys are values of column(s) of a dataframe, usually in preparation for aggregating with respect to the other columns. :type: arkouda.groupbyclass.GroupBy .. attribute:: df The dataframe containing the original data. :type: arkouda.dataframe.DataFrame .. attribute:: gb_key_names The column name(s) associated with the aggregated columns. :type: str or list(str) .. attribute:: as_index If True the grouped values of the aggregation keys will be treated as an index. :type: bool, default=True .. py:method:: all(colnames=None) Aggregate the operation, with the grouped column(s) values as keys. :param colnames: Column name or list of column names to compute the aggregation over. :type colnames: (list of) str, default=None :rtype: arkouda.dataframe.DataFrame .. py:method:: any(colnames=None) Aggregate the operation, with the grouped column(s) values as keys. :param colnames: Column name or list of column names to compute the aggregation over. :type colnames: (list of) str, default=None :rtype: arkouda.dataframe.DataFrame .. py:method:: argmax(colnames=None) Aggregate the operation, with the grouped column(s) values as keys. :param colnames: Column name or list of column names to compute the aggregation over. :type colnames: (list of) str, default=None :rtype: arkouda.dataframe.DataFrame .. py:method:: argmin(colnames=None) Aggregate the operation, with the grouped column(s) values as keys. :param colnames: Column name or list of column names to compute the aggregation over. :type colnames: (list of) str, default=None :rtype: arkouda.dataframe.DataFrame .. py:method:: broadcast(x, permute=True) Fill each group’s segment with a constant value. :param x: The values to put in each group’s segment. :type x: Series or pdarray :param permute: If True (default), permute broadcast values back to the ordering of the original array on which GroupBy was called. If False, the broadcast values are grouped by value. :type permute: bool, default=True :returns: A Series with the Index of the original frame and the values of the broadcast. :rtype: arkouda.series.Series .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> from arkouda.dataframe import DataFrameGroupBy >>> df = ak.DataFrame({"A":[1,2,2,3],"B":[3,4,5,6]}) +----+-----+-----+ | | A | B | +====+=====+=====+ | 0 | 1 | 3 | +----+-----+-----+ | 1 | 2 | 4 | +----+-----+-----+ | 2 | 2 | 5 | +----+-----+-----+ | 3 | 3 | 6 | +----+-----+-----+ >>> gb = df.groupby("A") >>> x = ak.array([10,11,12]) >>> s = DataFrameGroupBy.broadcast(gb, x) >>> df["C"] = s.values >>> display(df) +----+-----+-----+-----+ | | A | B | C | +====+=====+=====+=====+ | 0 | 1 | 3 | 10 | +----+-----+-----+-----+ | 1 | 2 | 4 | 11 | +----+-----+-----+-----+ | 2 | 2 | 5 | 11 | +----+-----+-----+-----+ | 3 | 3 | 6 | 12 | +----+-----+-----+-----+ .. py:method:: count(colnames=None) Aggregate the operation, with the grouped column(s) values as keys. :param colnames: Column name or list of column names to compute the aggregation over. :type colnames: (list of) str, default=None :rtype: arkouda.dataframe.DataFrame .. py:method:: diff(colname) Create a difference aggregate for the given column. For each group, the difference between successive values is calculated. Aggregate operations (mean,min,max,std,var) can be done on the results. :param colname: Name of the column to compute the difference on. :type colname: str :returns: Object containing the differences, which can be aggregated. :rtype: DiffAggregate .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({"A":[1,2,2,2,3,3],"B":[3,9,11,27,86,100]}) >>> display(df) +----+-----+-----+ | | A | B | +====+=====+=====+ | 0 | 1 | 3 | +----+-----+-----+ | 1 | 2 | 9 | +----+-----+-----+ | 2 | 2 | 11 | +----+-----+-----+ | 3 | 2 | 27 | +----+-----+-----+ | 4 | 3 | 86 | +----+-----+-----+ | 5 | 3 | 100 | +----+-----+-----+ >>> gb = df.groupby("A") >>> gb.diff("B").values array([nan nan 2.00000000000000000 16.00000000000000000 nan 14.00000000000000000]) .. py:method:: first(colnames=None) Aggregate the operation, with the grouped column(s) values as keys. :param colnames: Column name or list of column names to compute the aggregation over. :type colnames: (list of) str, default=None :rtype: arkouda.dataframe.DataFrame .. py:method:: head(n: int = 5, sort_index: bool = True) -> DataFrame Return the first n rows from each group. :param n: Maximum number of rows to return for each group. If the number of rows in a group is less than n, all the values from that group will be returned. :type n: int, optional, default = 5 :param sort_index: If true, return the DataFrame with indices sorted. :type sort_index: bool, default = True :rtype: arkouda.dataframe.DataFrame .. rubric:: Examples >>> import arkouda as ak >>> from arkouda import * >>> df = ak.DataFrame({"a":ak.arange(10) %3 , "b":ak.arange(10)}) +----+-----+-----+ | | a | b | +====+=====+=====+ | 0 | 0 | 0 | +----+-----+-----+ | 1 | 1 | 1 | +----+-----+-----+ | 2 | 2 | 2 | +----+-----+-----+ | 3 | 0 | 3 | +----+-----+-----+ | 4 | 1 | 4 | +----+-----+-----+ | 5 | 2 | 5 | +----+-----+-----+ | 6 | 0 | 6 | +----+-----+-----+ | 7 | 1 | 7 | +----+-----+-----+ | 8 | 2 | 8 | +----+-----+-----+ | 9 | 0 | 9 | +----+-----+-----+ >>> df.groupby("a").head(2) +----+-----+-----+ | | a | b | +====+=====+=====+ | 0 | 0 | 0 | +----+-----+-----+ | 1 | 0 | 3 | +----+-----+-----+ | 2 | 1 | 1 | +----+-----+-----+ | 3 | 1 | 4 | +----+-----+-----+ | 4 | 2 | 2 | +----+-----+-----+ | 5 | 2 | 5 | +----+-----+-----+ .. py:method:: max(colnames=None) Aggregate the operation, with the grouped column(s) values as keys. :param colnames: Column name or list of column names to compute the aggregation over. :type colnames: (list of) str, default=None :rtype: arkouda.dataframe.DataFrame .. py:method:: mean(colnames=None) Aggregate the operation, with the grouped column(s) values as keys. :param colnames: Column name or list of column names to compute the aggregation over. :type colnames: (list of) str, default=None :rtype: arkouda.dataframe.DataFrame .. py:method:: median(colnames=None) Aggregate the operation, with the grouped column(s) values as keys. :param colnames: Column name or list of column names to compute the aggregation over. :type colnames: (list of) str, default=None :rtype: arkouda.dataframe.DataFrame .. py:method:: min(colnames=None) Aggregate the operation, with the grouped column(s) values as keys. :param colnames: Column name or list of column names to compute the aggregation over. :type colnames: (list of) str, default=None :rtype: arkouda.dataframe.DataFrame .. py:method:: mode(colnames=None) Aggregate the operation, with the grouped column(s) values as keys. :param colnames: Column name or list of column names to compute the aggregation over. :type colnames: (list of) str, default=None :rtype: arkouda.dataframe.DataFrame .. py:method:: nunique(colnames=None) Aggregate the operation, with the grouped column(s) values as keys. :param colnames: Column name or list of column names to compute the aggregation over. :type colnames: (list of) str, default=None :rtype: arkouda.dataframe.DataFrame .. py:method:: prod(colnames=None) Aggregate the operation, with the grouped column(s) values as keys. :param colnames: Column name or list of column names to compute the aggregation over. :type colnames: (list of) str, default=None :rtype: arkouda.dataframe.DataFrame .. py:method:: sample(n=None, frac=None, replace=False, weights=None, random_state=None) Return a random sample from each group. You can either specify the number of elements or the fraction of elements to be sampled. random_state can be used for reproducibility :param n: Number of items to return for each group. Cannot be used with frac and must be no larger than the smallest group unless replace is True. Default is one if frac is None. :type n: int, optional :param frac: Fraction of items to return. Cannot be used with n. :type frac: float, optional :param replace: Allow or disallow sampling of the same row more than once. :type replace: bool, default False :param weights: Default None results in equal probability weighting. If passed a pdarray, then values must have the same length as the underlying DataFrame and will be used as sampling probabilities after normalization within each group. Weights must be non-negative with at least one positive element within each group. :type weights: pdarray, optional :param random_state: If int, seed for random number generator. If ak.random.Generator, use as given. :type random_state: int or ak.random.Generator, optional :returns: A new DataFrame containing items randomly sampled from each group sorted according to the grouped columns. :rtype: DataFrame .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({"A":[3,1,2,1,2,3],"B":[3,4,5,6,7,8]}) >>> display(df) +----+-----+-----+ | | A | B | +====+=====+=====+ | 0 | 3 | 3 | +----+-----+-----+ | 1 | 1 | 4 | +----+-----+-----+ | 2 | 2 | 5 | +----+-----+-----+ | 3 | 1 | 6 | +----+-----+-----+ | 4 | 2 | 7 | +----+-----+-----+ | 5 | 3 | 8 | +----+-----+-----+ >>> df.groupby("A").sample(random_state=6) +----+-----+-----+ | | A | B | +====+=====+=====+ | 3 | 1 | 6 | +----+-----+-----+ | 4 | 2 | 7 | +----+-----+-----+ | 5 | 3 | 8 | +----+-----+-----+ >>> df.groupby("A").sample(frac=0.5, random_state=3, weights=ak.array([1,1,1,0,0,0])) +----+-----+-----+ | | A | B | +====+=====+=====+ | 1 | 1 | 4 | +----+-----+-----+ | 2 | 2 | 5 | +----+-----+-----+ | 0 | 3 | 3 | +----+-----+-----+ >>> df.groupby("A").sample(n=3, replace=True, random_state=ak.random.default_rng(7)) +----+-----+-----+ | | A | B | +====+=====+=====+ | 1 | 1 | 4 | +----+-----+-----+ | 3 | 1 | 6 | +----+-----+-----+ | 1 | 1 | 4 | +----+-----+-----+ | 4 | 2 | 7 | +----+-----+-----+ | 4 | 2 | 7 | +----+-----+-----+ | 4 | 2 | 7 | +----+-----+-----+ | 0 | 3 | 3 | +----+-----+-----+ | 5 | 3 | 8 | +----+-----+-----+ | 5 | 3 | 8 | +----+-----+-----+ .. py:method:: size(as_series=None, sort_index=True) Compute the size of each value as the total number of rows, including NaN values. :param as_series: Indicates whether to return arkouda.dataframe.DataFrame (if as_series = False) or arkouda.series.Series (if as_series = True) :type as_series: bool, default=None :param sort_index: If True, results will be returned with index values sorted in ascending order. :type sort_index: bool, default=True :rtype: arkouda.dataframe.DataFrame or arkouda.series.Series .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> df = ak.DataFrame({"A":[1,2,2,3],"B":[3,4,5,6]}) >>> display(df) +----+-----+-----+ | | A | B | +====+=====+=====+ | 0 | 1 | 3 | +----+-----+-----+ | 1 | 2 | 4 | +----+-----+-----+ | 2 | 2 | 5 | +----+-----+-----+ | 3 | 3 | 6 | +----+-----+-----+ >>> df.groupby("A").size(as_series = False) +----+---------+ | | size | +====+=========+ | 0 | 1 | +----+---------+ | 1 | 2 | +----+---------+ | 2 | 1 | +----+---------+ .. py:method:: std(colnames=None) Aggregate the operation, with the grouped column(s) values as keys. :param colnames: Column name or list of column names to compute the aggregation over. :type colnames: (list of) str, default=None :rtype: arkouda.dataframe.DataFrame .. py:method:: sum(colnames=None) Aggregate the operation, with the grouped column(s) values as keys. :param colnames: Column name or list of column names to compute the aggregation over. :type colnames: (list of) str, default=None :rtype: arkouda.dataframe.DataFrame .. py:method:: tail(n: int = 5, sort_index: bool = True) -> DataFrame Return the last n rows from each group. :param n: Maximum number of rows to return for each group. If the number of rows in a group is less than n, all the rows from that group will be returned. :type n: int, optional, default = 5 :param sort_index: If true, return the DataFrame with indices sorted. :type sort_index: bool, default = True :rtype: arkouda.dataframe.DataFrame .. rubric:: Examples >>> import arkouda as ak >>> from arkouda import * >>> df = ak.DataFrame({"a":ak.arange(10) %3 , "b":ak.arange(10)}) +----+-----+-----+ | | a | b | +====+=====+=====+ | 0 | 0 | 0 | +----+-----+-----+ | 1 | 1 | 1 | +----+-----+-----+ | 2 | 2 | 2 | +----+-----+-----+ | 3 | 0 | 3 | +----+-----+-----+ | 4 | 1 | 4 | +----+-----+-----+ | 5 | 2 | 5 | +----+-----+-----+ | 6 | 0 | 6 | +----+-----+-----+ | 7 | 1 | 7 | +----+-----+-----+ | 8 | 2 | 8 | +----+-----+-----+ | 9 | 0 | 9 | +----+-----+-----+ >>> df.groupby("a").tail(2) +----+-----+-----+ | | a | b | +====+=====+=====+ | 0 | 0 | 6 | +----+-----+-----+ | 1 | 0 | 9 | +----+-----+-----+ | 2 | 1 | 4 | +----+-----+-----+ | 3 | 1 | 7 | +----+-----+-----+ | 4 | 2 | 5 | +----+-----+-----+ | 5 | 2 | 8 | +----+-----+-----+ .. py:method:: unique(colnames=None) Aggregate the operation, with the grouped column(s) values as keys. :param colnames: Column name or list of column names to compute the aggregation over. :type colnames: (list of) str, default=None :rtype: arkouda.dataframe.DataFrame .. py:method:: var(colnames=None) Aggregate the operation, with the grouped column(s) values as keys. :param colnames: Column name or list of column names to compute the aggregation over. :type colnames: (list of) str, default=None :rtype: arkouda.dataframe.DataFrame .. py:method:: xor(colnames=None) Aggregate the operation, with the grouped column(s) values as keys. :param colnames: Column name or list of column names to compute the aggregation over. :type colnames: (list of) str, default=None :rtype: arkouda.dataframe.DataFrame .. py:class:: DiffAggregate A column in a GroupBy that has been differenced. Aggregation operations can be done on the result. .. attribute:: gb GroupBy object, where the aggregation keys are values of column(s) of a dataframe. :type: arkouda.groupbyclass.GroupBy .. attribute:: values A column to compute the difference on. :type: arkouda.series.Series. .. py:method:: all() .. py:method:: any() .. py:method:: argmax() .. py:method:: argmin() .. py:method:: count() .. py:method:: first() .. py:method:: max() .. py:method:: mean() .. py:method:: median() .. py:method:: min() .. py:method:: mode() .. py:method:: nunique() .. py:method:: prod() .. py:method:: std() .. py:method:: sum() .. py:method:: unique() .. py:method:: var() .. py:method:: xor() .. py:function:: intersect(a, b, positions=True, unique=False) Find the intersection of two arkouda arrays. This function can be especially useful when `positions=True` so that the caller gets the indices of values present in both arrays. :param a: An array of strings. :type a: Strings or pdarray :param b: An array of strings. :type b: Strings or pdarray :param positions: Return tuple of boolean pdarrays that indicate positions in `a` and `b` of the intersection values. :type positions: bool, default=True :param unique: If the number of distinct values in `a` (and `b`) is equal to the size of `a` (and `b`), there is a more efficient method to compute the intersection. :type unique: bool, default=False :returns: The indices of `a` and `b` where any element occurs at least once in both arrays. :rtype: (arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray) or arkouda.pdarrayclass.pdarray .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> a = ak.arange(10) >>> print(a) [0 1 2 3 4 5 6 7 8 9] >>> b = 2 * ak.arange(10) >>> print(b) [0 2 4 6 8 10 12 14 16 18] >>> intersect(a,b, positions=True) (array([True False True False True False True False True False]), array([True True True True True False False False False False])) >>> intersect(a,b, positions=False) array([0 2 4 6 8]) .. py:function:: intx(a, b) Find all the rows that are in both dataframes. Columns should be in identical order. Note: does not work for columns of floating point values, but does work for Strings, pdarrays of int64 type, and Categorical *should* work. .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> a = ak.DataFrame({'a':ak.arange(5),'b': 2* ak.arange(5)}) >>> display(a) +----+-----+-----+ | | a | b | +====+=====+=====+ | 0 | 0 | 0 | +----+-----+-----+ | 1 | 1 | 2 | +----+-----+-----+ | 2 | 2 | 4 | +----+-----+-----+ | 3 | 3 | 6 | +----+-----+-----+ | 4 | 4 | 8 | +----+-----+-----+ >>> b = ak.DataFrame({'a':ak.arange(5),'b':ak.array([0,3,4,7,8])}) >>> display(b) +----+-----+-----+ | | a | b | +====+=====+=====+ | 0 | 0 | 0 | +----+-----+-----+ | 1 | 1 | 3 | +----+-----+-----+ | 2 | 2 | 4 | +----+-----+-----+ | 3 | 3 | 7 | +----+-----+-----+ | 4 | 4 | 8 | +----+-----+-----+ >>> intx(a,b) >>> intersect_df = a[intx(a,b)] >>> display(intersect_df) +----+-----+-----+ | | a | b | +====+=====+=====+ | 0 | 0 | 0 | +----+-----+-----+ | 1 | 2 | 4 | +----+-----+-----+ | 2 | 4 | 8 | +----+-----+-----+ .. py:function:: invert_permutation(perm) Find the inverse of a permutation array. :param perm: The permutation array. :type perm: pdarray :returns: The inverse of the permutation array. :rtype: arkouda.pdarrayclass.pdarray .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> from arkouda.index import Index >>> i = Index(ak.array([1,2,0,5,4])) >>> perm = i.argsort() >>> print(perm) [2 0 1 4 3] >>> invert_permutation(perm) array([1 2 0 4 3]) .. py:function:: merge(left: DataFrame, right: DataFrame, on: Optional[Union[str, List[str]]] = None, how: str = 'inner', left_suffix: str = '_x', right_suffix: str = '_y', convert_ints: bool = True, sort: bool = True) -> DataFrame Merge Arkouda DataFrames with a database-style join. The resulting dataframe contains rows from both DataFrames as specified by the merge condition (based on the "how" and "on" parameters). Based on pandas merge functionality. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html :param left: The Left DataFrame to be joined. :type left: DataFrame :param right: The Right DataFrame to be joined. :type right: DataFrame :param on: The name or list of names of the DataFrame column(s) to join on. If on is None, this defaults to the intersection of the columns in both DataFrames. :type on: Optional[Union[str, List[str]]] = None :param how: The merge condition. Must be one of "inner", "left", "right", or "outer". :type how: str, default = "inner" :param left_suffix: A string indicating the suffix to add to columns from the left dataframe for overlapping column names in both left and right. Defaults to "_x". Only used when how is "inner". :type left_suffix: str, default = "_x" :param right_suffix: A string indicating the suffix to add to columns from the right dataframe for overlapping column names in both left and right. Defaults to "_y". Only used when how is "inner". :type right_suffix: str, default = "_y" :param convert_ints: If True, convert columns with missing int values (due to the join) to float64. This is to match pandas. If False, do not convert the column dtypes. This has no effect when how = "inner". :type convert_ints: bool = True :param sort: If True, DataFrame is returned sorted by "on". Otherwise, the DataFrame is not sorted. :type sort: bool = True :returns: Joined Arkouda DataFrame. :rtype: arkouda.dataframe.DataFrame .. note:: Multiple column joins are only supported for integer columns. .. rubric:: Examples >>> import arkouda as ak >>> ak.connect() >>> from arkouda import merge >>> left_df = ak.DataFrame({'col1': ak.arange(5), 'col2': -1 * ak.arange(5)}) >>> display(left_df) +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 0 | 0 | +----+--------+--------+ | 1 | 1 | -1 | +----+--------+--------+ | 2 | 2 | -2 | +----+--------+--------+ | 3 | 3 | -3 | +----+--------+--------+ | 4 | 4 | -4 | +----+--------+--------+ >>> right_df = ak.DataFrame({'col1': 2 * ak.arange(5), 'col2': 2 * ak.arange(5)}) >>> display(right_df) +----+--------+--------+ | | col1 | col2 | +====+========+========+ | 0 | 0 | 0 | +----+--------+--------+ | 1 | 2 | 2 | +----+--------+--------+ | 2 | 4 | 4 | +----+--------+--------+ | 3 | 6 | 6 | +----+--------+--------+ | 4 | 8 | 8 | +----+--------+--------+ >>> merge(left_df, right_df, on = "col1") +----+--------+----------+----------+ | | col1 | col2_x | col2_y | +====+========+==========+==========+ | 0 | 0 | 0 | 0 | +----+--------+----------+----------+ | 1 | 2 | -2 | 2 | +----+--------+----------+----------+ | 2 | 4 | -4 | 4 | +----+--------+----------+----------+ >>> merge(left_df, right_df, on = "col1", how = "left") +----+--------+----------+----------+ | | col1 | col2_y | col2_x | +====+========+==========+==========+ | 0 | 0 | 0 | 0 | +----+--------+----------+----------+ | 1 | 1 | nan | -1 | +----+--------+----------+----------+ | 2 | 2 | 2 | -2 | +----+--------+----------+----------+ | 3 | 3 | nan | -3 | +----+--------+----------+----------+ | 4 | 4 | 4 | -4 | +----+--------+----------+----------+ >>> merge(left_df, right_df, on = "col1", how = "right") +----+--------+----------+----------+ | | col1 | col2_x | col2_y | +====+========+==========+==========+ | 0 | 0 | 0 | 0 | +----+--------+----------+----------+ | 1 | 2 | -2 | 2 | +----+--------+----------+----------+ | 2 | 4 | -4 | 4 | +----+--------+----------+----------+ | 3 | 6 | nan | 6 | +----+--------+----------+----------+ | 4 | 8 | nan | 8 | +----+--------+----------+----------+ >>> merge(left_df, right_df, on = "col1", how = "outer") +----+--------+----------+----------+ | | col1 | col2_y | col2_x | +====+========+==========+==========+ | 0 | 0 | 0 | 0 | +----+--------+----------+----------+ | 1 | 1 | nan | -1 | +----+--------+----------+----------+ | 2 | 2 | 2 | -2 | +----+--------+----------+----------+ | 3 | 3 | nan | -3 | +----+--------+----------+----------+ | 4 | 4 | 4 | -4 | +----+--------+----------+----------+ | 5 | 6 | 6 | nan | +----+--------+----------+----------+ | 6 | 8 | 8 | nan | +----+--------+----------+----------+