arkouda.io¶
Functions¶
|
Export data from Arkouda file (Parquet/HDF5) to Pandas object or file formatted to be |
|
Get a list of column names from CSV file(s). |
|
Get the names of the datasets in the provide files |
|
Get the type of a file accessible to the server. Supported |
|
Get null indices of a string column in a Parquet file. |
|
Import data from a file saved by Pandas (HDF5/Parquet) to Arkouda object and/or |
|
Load a pdarray previously saved with |
|
Load multiple pdarrays, Strings, SegArrays, or Categoricals previously |
|
This function calls the h5ls utility on a HDF5 file visible to the |
|
Used for identifying the datasets within a file when a CSV does not |
|
Read datasets from files. |
|
Read CSV file(s) into Arkouda objects. If more than one dataset is found, the objects |
|
Read Arkouda objects from HDF5 file/s |
|
Read Arkouda objects from Parquet file/s |
|
Read datasets from files and tag each record to the file it was read from. |
|
Reads a Zarr store from disk into a pdarray. Supports multi-dimensional pdarrays of numeric types. |
|
Receive a pdarray sent by pdarray.transfer(). |
|
Receive a pdarray sent by dataframe.transfer(). |
|
Return data saved using ak.snapshot |
|
DEPRECATED |
|
Create a snapshot of the current Arkouda namespace. All currently accessible variables containing |
|
Write Arkouda object(s) to CSV file(s). All CSV Files written by Arkouda |
|
Save multiple named pdarrays to HDF5 files. |
|
Save multiple named pdarrays to Parquet files. |
|
Writes a pdarray to disk as a Zarr store. Supports multi-dimensional pdarrays of numeric types. |
|
Overwrite the datasets with name appearing in names or keys in columns if columns |
Module Contents¶
- arkouda.io.export(read_path: str, dataset_name: str = 'ak_data', write_file: str | None = None, return_obj: bool = True, index: bool = False)[source]¶
Export data from Arkouda file (Parquet/HDF5) to Pandas object or file formatted to be readable by Pandas
- Parameters:
read_path (str) – path to file where arkouda data is stored.
dataset_name (str) – name to store dataset under
index (bool) – Default False. When True, maintain the indexes loaded from the pandas file
write_file (str, optional) – path to file to write pandas formatted data to. Only write the file if this is set
return_obj (bool, optional) – Default True. When True return the Pandas DataFrame object, otherwise return None
- Raises:
RuntimeError –
Unsupported file type
- Returns:
When return_obj=True
- Return type:
pd.DataFrame
See also
pandas.DataFrame.to_parquet
,pandas.DataFrame.to_hdf
,pandas.DataFrame.read_parquet
,pandas.DataFrame.read_hdf
,ak.import_data
Notes
If Arkouda file is exported for pandas, the format will not change. This mean parquet files will remain parquet and hdf5 will remain hdf5.
Export can only be performed from hdf5 or parquet files written by Arkouda. The result will be the same file type, but formatted to be read by Pandas.
- arkouda.io.get_columns(filenames: str | List[str], col_delim: str = ',', allow_errors: bool = False) List[str] [source]¶
Get a list of column names from CSV file(s).
- arkouda.io.get_datasets(filenames: str | List[str], allow_errors: bool = False, column_delim: str = ',', read_nested: bool = True) List[str] [source]¶
Get the names of the datasets in the provide files
- Parameters:
filenames (str or List[str]) – Name of the file/s from which to return datasets
allow_errors (bool) – Default: False Whether or not to allow errors while accessing datasets
column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Only used for Parquet Files.
- Return type:
List[str] of names of the datasets
- Raises:
RuntimeError –
If no datasets are returned
Notes
This function currently supports HDF5 and Parquet formats.
Future updates to Parquet will deprecate this functionality on that format,
but similar support will be added for Parquet at that time. - If a list of files is provided, only the datasets in the first file will be returned
See also
- arkouda.io.get_filetype(filenames: str | List[str]) str [source]¶
Get the type of a file accessible to the server. Supported file types and possible return strings are ‘HDF5’ and ‘Parquet’.
- Parameters:
filenames (Union[str, List[str]]) – A file or list of files visible to the arkouda server
- Returns:
Type of the file returned as a string, either ‘HDF5’, ‘Parquet’ or ‘CSV
- Return type:
str
- Raises:
ValueError – Raised if filename is empty or contains only whitespace
Notes
When list provided, it is assumed that all files are the same type
CSV Files without the Arkouda Header are not supported
See also
- arkouda.io.get_null_indices(filenames: str | List[str], datasets: str | List[str] | None = None) arkouda.pdarrayclass.pdarray | Mapping[str, arkouda.pdarrayclass.pdarray] [source]¶
Get null indices of a string column in a Parquet file.
- Parameters:
filenames (list or str) – Either a list of filenames or shell expression
datasets (list or str or None) – (List of) name(s) of dataset(s) to read. Each dataset must be a string column. There is no default value for this function, the datasets to be read must be specified.
- Returns:
Dictionary of {datasetName: pdarray}
- Return type:
returns a dictionary of Arkouda pdarrays
- Raises:
RuntimeError – Raised if one or more of the specified files cannot be opened.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
See also
- arkouda.io.import_data(read_path: str, write_file: str | None = None, return_obj: bool = True, index: bool = False)[source]¶
Import data from a file saved by Pandas (HDF5/Parquet) to Arkouda object and/or a file formatted to be read by Arkouda.
- Parameters:
read_path (str) – path to file where pandas data is stored. This can be glob expression for parquet formats.
write_file (str, optional) – path to file to write arkouda formatted data to. Only write file if provided
return_obj (bool, optional) – Default True. When True return the Arkouda DataFrame object, otherwise return None
index (bool, optional) – Default False. When True, maintain the indexes loaded from the pandas file
- Raises:
RuntimeWarning –
Export attempted on Parquet file. Arkouda formatted Parquet files are readable by pandas.
RuntimeError –
Unsupported file type
- Returns:
When return_obj=True
- Return type:
pd.DataFrame
See also
pandas.DataFrame.to_parquet
,pandas.DataFrame.to_hdf
,pandas.DataFrame.read_parquet
,pandas.DataFrame.read_hdf
,ak.export
Notes
Import can only be performed from hdf5 or parquet files written by pandas.
- arkouda.io.load(path_prefix: str, file_format: str = 'INFER', dataset: str = 'array', calc_string_offsets: bool = False, column_delim: str = ',') Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.categorical.Categorical | arkouda.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.timeclass.Datetime | arkouda.timeclass.Timedelta | arkouda.index.Index] [source]¶
Load a pdarray previously saved with
pdarray.save()
.- Parameters:
path_prefix (str) – Filename prefix used to save the original pdarray
file_format (str) – ‘INFER’, ‘HDF5’ or ‘Parquet’. Defaults to ‘INFER’. Used to indicate the file type being loaded. If INFER, this will be detected during processing
dataset (str) – Dataset name where the pdarray was saved, defaults to ‘array’
calc_string_offsets (bool) – If True the server will ignore Segmented Strings ‘offsets’ array and derive it from the null-byte terminators. Defaults to False currently
column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.
- Returns:
Dictionary of {datsetName: Union[pdarray, Strings, SegArray, Categorical]} with the previously saved pdarrays, Strings, SegArrays, or Categoricals
- Return type:
Mapping[str, Union[pdarray, Strings, SegArray, Categorical]]
- Raises:
TypeError – Raised if either path_prefix or dataset is not a str
ValueError – Raised if invalid file_format or if the dataset is not present in all hdf5 files or if the path_prefix does not correspond to files accessible to Arkouda
RuntimeError – Raised if the hdf5 files are present but there is an error in opening one or more of them
See also
Notes
If you have a previously saved Parquet file that is raising a FileNotFound error, try loading it with a .parquet appended to the prefix_path. Parquet files were previously ALWAYS stored with a
.parquet
extension.ak.load does not support loading a single file. For loading single HDF5 files without the _LOCALE#### suffix please use ak.read().
CSV files without the Arkouda Header are not supported.
Examples
>>> # Loading from file without extension >>> obj = ak.load('path/prefix') Loads the array from numLocales files with the name ``cwd/path/name_prefix_LOCALE####``. The file type is inferred during processing.
>>> # Loading with an extension (HDF5) >>> obj = ak.load('path/prefix.test') Loads the object from numLocales files with the name ``cwd/path/name_prefix_LOCALE####.test`` where #### is replaced by each locale numbers. Because filetype is inferred during processing, the extension is not required to be a specific format.
- arkouda.io.load_all(path_prefix: str, file_format: str = 'INFER', column_delim: str = ',', read_nested=True) Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.categorical.Categorical] [source]¶
Load multiple pdarrays, Strings, SegArrays, or Categoricals previously saved with
save_all()
.- Parameters:
path_prefix (str) – Filename prefix used to save the original pdarray
file_format (str) – ‘INFER’, ‘HDF5’, ‘Parquet’, or ‘CSV’. Defaults to ‘INFER’. Indicates the format being loaded. When ‘INFER’ the processing will detect the format Defaults to ‘INFER’
column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Parquet files only
- Returns:
Dictionary of {datsetName: Union[pdarray, Strings, SegArray, Categorical]} with the previously saved pdarrays, Strings, SegArrays, or Categoricals
- Return type:
Mapping[str, Union[pdarray, Strings, SegArray, Categorical]]
- Raises:
TypeError: – Raised if path_prefix is not a str
ValueError – Raised if file_format/extension is encountered that is not hdf5 or parquet or if all datasets are not present in all hdf5/parquet files or if the path_prefix does not correspond to files accessible to Arkouda
RuntimeError – Raised if the hdf5 files are present but there is an error in opening one or more of them
See also
to_parquet
,to_hdf
,load
,read
Notes
This function has been updated to determine the file extension based on the file format variable
This function will be deprecated when glob flags are added to read_* methods
CSV files without the Arkouda Header are not supported.
- arkouda.io.ls(filename: str, col_delim: str = ',', read_nested: bool = True) List[str] [source]¶
This function calls the h5ls utility on a HDF5 file visible to the arkouda server or calls a function that imitates the result of h5ls on a Parquet file.
- Parameters:
filename (str) – The name of the file to pass to the server
col_delim (str) – The delimiter used to separate columns if the file is a csv
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Only used for Parquet files.
- Returns:
The string output of the datasets from the server
- Return type:
str
- Raises:
TypeError – Raised if filename is not a str
ValueError – Raised if filename is empty or contains only whitespace
RuntimeError – Raised if error occurs in executing ls on an HDF5 file
Notes –
- This will need to be updated because Parquet will not technically support this when we update.
Similar functionality will be added for Parquet in the future
For CSV files without headers, please use ls_csv
See also
- arkouda.io.ls_csv(filename: str, col_delim: str = ',') List[str] [source]¶
Used for identifying the datasets within a file when a CSV does not have a header.
- Parameters:
filename (str) – The name of the file to pass to the server
col_delim (str) – The delimiter used to separate columns if the file is a csv
- Returns:
The string output of the datasets from the server
- Return type:
str
See also
- arkouda.io.read(filenames: str | List[str], datasets: str | List[str] | None = None, iterative: bool = False, strictTypes: bool = True, allow_errors: bool = False, calc_string_offsets=False, column_delim: str = ',', read_nested: bool = True, has_non_float_nulls: bool = False, fixed_len: int = -1) Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.categorical.Categorical | arkouda.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.timeclass.Datetime | arkouda.timeclass.Timedelta | arkouda.index.Index] [source]¶
Read datasets from files. File Type is determined automatically.
- Parameters:
filenames (list or str) – Either a list of filenames or shell expression
datasets (list or str or None) – (List of) name(s) of dataset(s) to read (default: all available)
iterative (bool) – Iterative (True) or Single (False) function call(s) to server
strictTypes (bool) – If True (default), require all dtypes of a given dataset to have the same precision and sign. If False, allow dtypes of different precision and sign across different files. For example, if one file contains a uint32 dataset and another contains an int64 dataset with the same name, the contents of both will be read into an int64 pdarray.
allow_errors (bool) – Default False, if True will allow files with read errors to be skipped instead of failing. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.
calc_string_offsets (bool) – Default False, if True this will tell the server to calculate the offsets/segments array on the server versus loading them from HDF5 files. In the future this option may be set to True as the default.
column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Ignored if datasets is not None Parquet Files only.
has_non_float_nulls (bool) – Default False. This flag must be set to True to read non-float parquet columns that contain null values.
fixed_len (int) – Default -1. This value can be set for reading Parquet string columns when the length of each string is known at runtime. This can allow for skipping byte calculation, which can have an impact on performance.
- Returns:
Dictionary of {datasetName: pdarray, String, or SegArray}
- Return type:
Returns a dictionary of Arkouda pdarrays, Arkouda Strings, or Arkouda Segarrays.
- Raises:
RuntimeError – If invalid filetype is detected
See also
Notes
If filenames is a string, it is interpreted as a shell expression (a single filename is a valid expression, so it will work) and is expanded with glob to read all matching files.
If iterative == True each dataset name and file names are passed to the server as independent sequential strings while if iterative == False all dataset names and file names are passed to the server in a single string.
If datasets is None, infer the names of datasets from the first file and read all of them. Use
get_datasets
to show the names of datasets to HDF5/Parquet files.CSV files without the Arkouda Header are not supported.
Examples
Read with file Extension >>> x = ak.read(‘path/name_prefix.h5’) # load HDF5 - processing determines file type not extension Read without file Extension >>> x = ak.read(‘path/name_prefix.parquet’) # load Parquet Read Glob Expression >>> x = ak.read(‘path/name_prefix*’) # Reads HDF5
- arkouda.io.read_csv(filenames: str | List[str], datasets: str | List[str] | None = None, column_delim: str = ',', allow_errors: bool = False) Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.categorical.Categorical | arkouda.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.timeclass.Datetime | arkouda.timeclass.Timedelta | arkouda.index.Index] [source]¶
Read CSV file(s) into Arkouda objects. If more than one dataset is found, the objects will be returned in a dictionary mapping the dataset name to the Arkouda object containing the data. If the file contains the appropriately formatted header, typed data will be returned. Otherwise, all data will be returned as a Strings object.
- Parameters:
filenames (str or List[str]) – The filenames to read data from
datasets (str or List[str] (Optional)) – names of the datasets to read. When None, all datasets will be read.
column_delim (str) – The delimiter for column names and data. Defaults to “,”.
allow_errors (bool) – Default False, if True will allow files with read errors to be skipped instead of failing. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.
- Returns:
Dictionary of {datasetName: pdarray, String, or SegArray}
- Return type:
Returns a dictionary of Arkouda pdarrays, Arkouda Strings, or Arkouda Segarrays.
- Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
See also
Notes
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (
\n
) at this time.Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).
- arkouda.io.read_hdf(filenames: str | List[str], datasets: str | List[str] | None = None, iterative: bool = False, strict_types: bool = True, allow_errors: bool = False, calc_string_offsets: bool = False, tag_data=False) Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.categorical.Categorical | arkouda.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.timeclass.Datetime | arkouda.timeclass.Timedelta | arkouda.index.Index] [source]¶
Read Arkouda objects from HDF5 file/s
- Parameters:
filenames (str, List[str]) – Filename/s to read objects from
datasets (Optional str, List[str]) – datasets to read from the provided files
iterative (bool) – Iterative (True) or Single (False) function call(s) to server
strict_types (bool) – If True (default), require all dtypes of a given dataset to have the same precision and sign. If False, allow dtypes of different precision and sign across different files. For example, if one file contains a uint32 dataset and another contains an int64 dataset with the same name, the contents of both will be read into an int64 pdarray.
allow_errors (bool) – Default False, if True will allow files with read errors to be skipped instead of failing. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.
calc_string_offsets (bool) – Default False, if True this will tell the server to calculate the offsets/segments array on the server versus loading them from HDF5 files. In the future this option may be set to True as the default.
tagData (bool) – Default False, if True tag the data with the code associated with the filename that the data was pulled from.
- Returns:
Dictionary of {datasetName: pdarray, String, SegArray}
- Return type:
Returns a dictionary of Arkouda pdarrays, Arkouda Strings, or Arkouda Segarrays.
- Raises:
ValueError – Raised if all datasets are not present in all hdf5 files or if one or more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
Notes
If filenames is a string, it is interpreted as a shell expression (a single filename is a valid expression, so it will work) and is expanded with glob to read all matching files.
If iterative == True each dataset name and file names are passed to the server as independent sequential strings while if iterative == False all dataset names and file names are passed to the server in a single string.
If datasets is None, infer the names of datasets from the first file and read all of them. Use
get_datasets
to show the names of datasets to HDF5 files.See also
Examples
>>> # Read with file Extension >>> x = ak.read_hdf('path/name_prefix.h5') # load HDF5 # Read Glob Expression >>> x = ak.read_hdf('path/name_prefix*') # Reads HDF5
- arkouda.io.read_parquet(filenames: str | List[str], datasets: str | List[str] | None = None, iterative: bool = False, strict_types: bool = True, allow_errors: bool = False, tag_data: bool = False, read_nested: bool = True, has_non_float_nulls: bool = False, fixed_len: int = -1) Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.categorical.Categorical | arkouda.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.timeclass.Datetime | arkouda.timeclass.Timedelta | arkouda.index.Index] [source]¶
Read Arkouda objects from Parquet file/s
- Parameters:
filenames (str, List[str]) – Filename/s to read objects from
datasets (Optional str, List[str]) – datasets to read from the provided files
iterative (bool) – Iterative (True) or Single (False) function call(s) to server
strict_types (bool) – If True (default), require all dtypes of a given dataset to have the same precision and sign. If False, allow dtypes of different precision and sign across different files. For example, if one file contains a uint32 dataset and another contains an int64 dataset with the same name, the contents of both will be read into an int64 pdarray.
allow_errors (bool) – Default False, if True will allow files with read errors to be skipped instead of failing. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.
tagData (bool) – Default False, if True tag the data with the code associated with the filename that the data was pulled from.
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. If datasets is not None, this will be ignored.
has_non_float_nulls (bool) – Default False. This flag must be set to True to read non-float parquet columns that contain null values.
fixed_len (int) – Default -1. This value can be set for reading Parquet string columns when the length of each string is known at runtime. This can allow for skipping byte calculation, which can have an impact on performance.
- Returns:
Dictionary of {datasetName: pdarray, String, or SegArray}
- Return type:
Returns a dictionary of Arkouda pdarrays, Arkouda Strings, or Arkouda Segarrays.
- Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
Notes
If filenames is a string, it is interpreted as a shell expression (a single filename is a valid expression, so it will work) and is expanded with glob to read all matching files.
If iterative == True each dataset name and file names are passed to the server as independent sequential strings while if iterative == False all dataset names and file names are passed to the server in a single string.
If datasets is None, infer the names of datasets from the first file and read all of them. Use
get_datasets
to show the names of datasets to Parquet files.Parquet always recomputes offsets at this time This will need to be updated once parquets workflow is updated
See also
Examples
Read without file Extension >>> x = ak.read_parquet(‘path/name_prefix.parquet’) # load Parquet Read Glob Expression >>> x = ak.read_parquet(‘path/name_prefix*’) # Reads Parquet
- arkouda.io.read_tagged_data(filenames: str | List[str], datasets: str | List[str] | None = None, strictTypes: bool = True, allow_errors: bool = False, calc_string_offsets=False, read_nested: bool = True, has_non_float_nulls: bool = False)[source]¶
Read datasets from files and tag each record to the file it was read from. File Type is determined automatically.
- Parameters:
filenames (list or str) – Either a list of filenames or shell expression
datasets (list or str or None) – (List of) name(s) of dataset(s) to read (default: all available)
strictTypes (bool) – If True (default), require all dtypes of a given dataset to have the same precision and sign. If False, allow dtypes of different precision and sign across different files. For example, if one file contains a uint32 dataset and another contains an int64 dataset with the same name, the contents of both will be read into an int64 pdarray.
allow_errors (bool) – Default False, if True will allow files with read errors to be skipped instead of failing. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.
calc_string_offsets (bool) – Default False, if True this will tell the server to calculate the offsets/segments array on the server versus loading them from HDF5 files. In the future this option may be set to True as the default.
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Ignored if datasets is not None Parquet Files only.
has_non_float_nulls (bool) – Default False. This flag must be set to True to read non-float parquet columns that contain null values.
Notes
Not currently supported for Categorical or GroupBy datasets
Examples
Read files and return data with tagging corresponding to the Categorical returned cat.codes will link the codes in data to the filename. Data will contain the code Filename_Codes >>> data, cat = ak.read_tagged_data(‘path/name’) >>> data {‘Filname_Codes’: array([0 3 6 9 12]), ‘col_name’: array([0 0 0 1])}
- arkouda.io.read_zarr(store_path: str, ndim: int, dtype)[source]¶
Reads a Zarr store from disk into a pdarray. Supports multi-dimensional pdarrays of numeric types. To use this function, ensure you have installed the blosc dependency (make install-blosc) and have included ZarrMsg.chpl in the ServerModules.cfg file.
- Parameters:
store_path (str) – The path to the Zarr store. The path must be to a directory that contains a .zarray file containing the Zarr store metadata.
ndim (int) – The number of dimensions in the array
dtype (str) – The data type of the array
- Returns:
The pdarray read from the Zarr store.
- Return type:
- arkouda.io.receive(hostname: str, port)[source]¶
Receive a pdarray sent by pdarray.transfer().
- Parameters:
hostname (str) – The hostname of the pdarray that sent the array
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to pdarray.transfer().
- Returns:
The pdarray sent from the sending server to the current receiving server.
- Return type:
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- arkouda.io.receive_dataframe(hostname: str, port)[source]¶
Receive a pdarray sent by dataframe.transfer().
- Parameters:
hostname (str) – The hostname of the dataframe that sent the array
port (int_scalars) – The port to send the dataframe over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to pdarray.send_array().
- Returns:
The dataframe sent from the sending server to the current receiving server.
- Return type:
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- arkouda.io.restore(filename)[source]¶
Return data saved using ak.snapshot
- Parameters:
filename (str)
read (Name used to create snapshot to be)
- Return type:
Dict
Notes
Unlike other save/load methods using snapshot restore will save DataFrames alongside other objects in HDF5. Thus, they are returned within the dictionary as a dataframe.
- arkouda.io.save_all(columns: Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray] | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray], prefix_path: str, names: List[str] | None = None, file_format='HDF5', mode: str = 'truncate', file_type: str = 'distribute', compression: str | None = None) None [source]¶
DEPRECATED Save multiple named pdarrays to HDF5/Parquet files. :param columns: Collection of arrays to save :type columns: dict or list of pdarrays :param prefix_path: Directory and filename prefix for output files :type prefix_path: str :param names: Dataset names for the pdarrays :type names: list of str :param file_format: ‘HDF5’ or ‘Parquet’. Defaults to hdf5 :type file_format: str :param mode: By default, truncate (overwrite) the output files if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
file_type (str ("single" | "distribute")) – Default: distribute Single writes the dataset to a single file Distribute writes the dataset to a file per locale Only used with HDF5
compression (str (None | "snappy" | "gzip" | "brotli" | "zstd" | "lz4")) – Optional Select the compression to use with Parquet files. Only used with Parquet.
- Return type:
None
- Raises:
ValueError – Raised if (1) the lengths of columns and values differ or (2) the mode is not ‘truncate’ or ‘append’
See also
save
,load_all
,to_parquet
,to_hdf
Notes
Creates one file per locale containing that locale’s chunk of each pdarray. If columns is a dictionary, the keys are used as the HDF5 dataset names. Otherwise, if no names are supplied, 0-up integers are used. By default, any existing files at path_prefix will be overwritten, unless the user specifies the ‘append’ mode, in which case arkouda will attempt to add <columns> as new datasets to existing files. If the wrong number of files is present or dataset names already exist, a RuntimeError is raised.
Examples
>>> a = ak.arange(25) >>> b = ak.arange(25) >>> # Save with mapping defining dataset names >>> ak.save_all({'a': a, 'b': b}, 'path/name_prefix', file_format='Parquet') >>> # Save using names instead of mapping >>> ak.save_all([a, b], 'path/name_prefix', names=['a', 'b'], file_format='Parquet')
- arkouda.io.snapshot(filename)[source]¶
Create a snapshot of the current Arkouda namespace. All currently accessible variables containing Arkouda objects will be written to an HDF5 file.
Unlike other save/load functions, this maintains the integrity of dataframes.
Current Variable names are used as the dataset name when saving.
- Parameters:
filename (str)
file (Name to use when storing)
- Return type:
None
See also
ak.restore
- arkouda.io.to_csv(columns: Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings] | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], prefix_path: str, names: List[str] | None = None, col_delim: str = ',', overwrite: bool = False)[source]¶
Write Arkouda object(s) to CSV file(s). All CSV Files written by Arkouda include a header denoting data types of the columns.
- Parameters:
columns (Mapping[str, pdarray] or List[pdarray]) – The objects to be written to CSV file. If a mapping is used and names is None the keys of the mapping will be used as the dataset names.
prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.
names (List[str] (Optional)) – names of dataset to be written. Order should correspond to the order of data provided in columns.
col_delim (str) – Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.
overwrite (bool) – Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.
- Return type:
None
- Raises:
ValueError – Raised if any datasets are present in all csv files or if one or more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
See also
Notes
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (
\n
) at this time.Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).
- arkouda.io.to_hdf(columns: Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray] | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray], prefix_path: str, names: List[str] | None = None, mode: str = 'truncate', file_type: str = 'distribute') None [source]¶
Save multiple named pdarrays to HDF5 files.
- Parameters:
columns (dict or list of pdarrays) – Collection of arrays to save
prefix_path (str) – Directory and filename prefix for output files
names (list of str) – Dataset names for the pdarrays
mode ({'truncate' | 'append'}) – By default, truncate (overwrite) the output files if they exist. If ‘append’, attempt to create new dataset in existing files.
file_type (str ("single" | "distribute")) – Default: distribute Single writes the dataset to a single file Distribute writes the dataset to a file per locale
- Return type:
None
- Raises:
ValueError – Raised if (1) the lengths of columns and values differ or (2) the mode is not ‘truncate’ or ‘append’
RuntimeError – Raised if a server-side error is thrown saving the pdarray
See also
Notes
Creates one file per locale containing that locale’s chunk of each pdarray. If columns is a dictionary, the keys are used as the HDF5 dataset names. Otherwise, if no names are supplied, 0-up integers are used. By default, any existing files at path_prefix will be overwritten, unless the user specifies the ‘append’ mode, in which case arkouda will attempt to add <columns> as new datasets to existing files. If the wrong number of files is present or dataset names already exist, a RuntimeError is raised.
Examples
>>> a = ak.arange(25) >>> b = ak.arange(25)
>>> # Save with mapping defining dataset names >>> ak.to_hdf({'a': a, 'b': b}, 'path/name_prefix')
>>> # Save using names instead of mapping >>> ak.to_hdf([a, b], 'path/name_prefix', names=['a', 'b'])
- arkouda.io.to_parquet(columns: Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray] | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray], prefix_path: str, names: List[str] | None = None, mode: str = 'truncate', compression: str | None = None, convert_categoricals: bool = False) None [source]¶
Save multiple named pdarrays to Parquet files.
- Parameters:
columns (dict or list of pdarrays) – Collection of arrays to save
prefix_path (str) – Directory and filename prefix for output files
names (list of str) – Dataset names for the pdarrays
mode ({'truncate' | 'append'}) – By default, truncate (overwrite) the output files if they exist. If ‘append’, attempt to create new dataset in existing files. ‘append’ is deprecated, please use the multi-column write
compression (str (Optional)) –
- Default None
Provide the compression type to use when writing the file. Supported values: snappy, gzip, brotli, zstd, lz4
- convert_categoricals: bool
Defaults to False Parquet requires all columns to be the same size and Categoricals don’t satisfy that requirement. if set, write the equivalent Strings in place of any Categorical columns.
- Return type:
None
- Raises:
ValueError – Raised if (1) the lengths of columns and values differ or (2) the mode is not ‘truncate’ or ‘append’
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
Creates one file per locale containing that locale’s chunk of each pdarray. If columns is a dictionary, the keys are used as the Parquet column names. Otherwise, if no names are supplied, 0-up integers are used. By default, any existing files at path_prefix will be overwritten, unless the user specifies the ‘append’ mode, in which case arkouda will attempt to add <columns> as new datasets to existing files. If the wrong number of files is present or dataset names already exist, a RuntimeError is raised.
Examples
>>> a = ak.arange(25) >>> b = ak.arange(25)
>>> # Save with mapping defining dataset names >>> ak.to_parquet({'a': a, 'b': b}, 'path/name_prefix')
>>> # Save using names instead of mapping >>> ak.to_parquet([a, b], 'path/name_prefix', names=['a', 'b'])
- arkouda.io.to_zarr(store_path: str, arr: arkouda.pdarrayclass.pdarray, chunk_shape)[source]¶
Writes a pdarray to disk as a Zarr store. Supports multi-dimensional pdarrays of numeric types. To use this function, ensure you have installed the blosc dependency (make install-blosc) and have included ZarrMsg.chpl in the ServerModules.cfg file.
- Parameters:
store_path (str) – The path at which Zarr store should be written
arr (pdarray) – The pdarray to be written to disk
chunk_shape (tuple) – The shape of the chunks to be used in the Zarr store
- Raises:
ValueError – Raised if the number of dimensions in the chunk shape does not match the number of dimensions in the array or if the array is not a 32 or 64 bit numeric type
- arkouda.io.update_hdf(columns: Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray] | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray], prefix_path: str, names: List[str] | None = None, repack: bool = True)[source]¶
Overwrite the datasets with name appearing in names or keys in columns if columns is a dictionary
- Parameters:
columns (dict or list of pdarrays) – Collection of arrays to save
prefix_path (str) – Directory and filename prefix for output files
names (list of str) – Dataset names for the pdarrays
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the datasets
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the datasets provided do not exist, they will be added
Because HDF5 deletes do not release memory, this will create a copy of the file with the new data
This workflow is slightly different from to_hdf to prevent reading and creating a copy of the file for each dataset