HDF5

HDF5 is an extremely flexible format. Because of this, it is important to adhere to these specifications in order for a file to be readable by Arkouda.

More information on HDF5 is available here.

File Configuration

Arkouda supports saving HDF5 files in 2 ways:

  • Single File

    • All data is pulled local to the processing root node and saved into one file

  • Distributed Files (Default)

    • Each file contains the portion of the data local to the locale where the file is being written. This results in one file per locale.

It is important to note that the file schemas are the same in both cases.

Supported Arkouda Data Types

While most objects in Arkouda can be saved, there are 3 main datatypes currently supported within HDF5.

  • pdarray

  • Strings

  • ArrayView

  • DataFrame

  • Index

  • Categorical

  • SegArray

  • GroupBy

HDF5 is able to contain any number of objects within the same file.

MetaData Attributes

All data within the HDF5 file is expected to contain several attributes that aide in determining the data within the object. These attributes are assigned at the Group and Dataset levels.

ObjType: int

Integer representing the type of object stored in the group/dataset. This corresponds to the Chapel enum ObjType. Required to properly read each object.

  • 0 = ArrayView

  • 1 = pdarray

  • 2 = Strings

  • 3 = SegArray

  • 4 = Categorical

  • 5 = GroupBy

isBool: int

Integer value (0 or 1) representing a boolean value that indicates if the data stored contains boolean values. This is only required to be set when the dataset contains boolean values.

file_version: real(32) (Optional)

Real value indicating the formatting version. 0.0 and 1.0 are no longer in use. Should be 2.0.

arkouda_version: c_string (Optional)

String value of the Arkouda version at the time the object was written.

The 2 attributes marked Optional are not required for data to be read. Thus, if you are reading data into Arkouda from another source, these can be omitted. However, any dataset written out by Arkodua will include this information.

Additional object types are being worked for direct support.

Data Schema

This section provides an outline of the expected data schema for each object type. Each example assumes the top level group/dataset is not nested.

When reading array values, the data type of the values is automatically detected and is therefore not required to be included in the metadata.

ArrayView

ArrayView objects require additional attributes to be read properly.

Rank: int

Integer representing the number of dimensions in the dataset. This should be stored as the rank of the unflattened data, even when storing as a flattened array.

Shape: int array

Integer array storing the size of each dimension. The array should be of length equal to the Rank.

Providing these attributes allows for the ArrayView object to be reconstructed from its values.

  1. Dataset (will have a user provided name. Defaults to ‘ArrayView’)

    1. Attributes

      1. ObjType: 0

      2. isBool: 0 or 1

      3. Rank: number_of_dimensions

      4. Shape: array_of_size_rank

      5. file_version: 2.0 (Optional)

      6. arkouda_version: ‘current_arkouda_version’ (Optional)

    2. Data - values of the ArrayView.

pdarray

  1. Dataset (will have a user provided name. Defaults to ‘array’)

    1. Attributes

      1. ObjType: 1

      2. isBool: 0 or 1

      3. file_version: 2.0 (Optional)

      4. arkouda_version: ‘current_arkouda_version’ (Optional)

    2. Data - values of the pdarray.

Strings

Strings objects are stored within an HDF5 group. This group contains datasets storing the values and segments separately.

  1. Group (user provided dataset name. Defaults to ‘strings_array’)

    1. Attributes

      1. ObjType: 2

      2. file_version: 2.0 (Optional)

      3. arkouda_version: ‘current_arkouda_version’ (Optional)

    2. Dataset - Values (user provided dataset name with _values appended)

      1. Attributes

        1. ObjType: 1

        2. isBool: 0 or 1

        3. file_version: 2.0 (Optional)

        4. arkouda_version: ‘current_arkouda_version’ (Optional)

      2. Data - uint8 values representing our string values. Includes null byte termination.

    3. Dataset - Offsets (user provided dataset name with _segments appended) (Optional)

      1. Attributes

        1. ObjType: 1

        2. isBool: 0

        3. file_version: 2.0 (Optional)

        4. arkouda_version: ‘current_arkouda_version’ (Optional)

      2. Data - int64 values representing in start index of each string value.

Please Note - The offsets dataset is not required but can be provided. Strings uses null byte termination and is able to calculate the offsets of its components during reads.

SegArray

SegArray objects are stored within an HDF5 group. This group contains datasets storing the values and segments separately.

  1. Group (user provided dataset name. Defaults to ‘segarray’)

    1. Attributes

      1. ObjType: 3

      2. file_version: 2.0 (Optional)

      3. arkouda_version: ‘current_arkouda_version’ (Optional)

    2. Dataset - Values

      1. Attributes

        1. ObjType: 1 or 2

        2. isBool: 0 or 1

        3. file_version: 2.0 (Optional)

        4. arkouda_version: ‘current_arkouda_version’ (Optional)

      2. Data - numeric values representing our string values. int64, uint64, float64, or bool.

    3. Dataset - Offsets

      1. Attributes

        1. ObjType: 1

        2. isBool: 0

        3. file_version: 2.0 (Optional)

        4. arkouda_version: ‘current_arkouda_version’ (Optional)

      2. Data - int64 values representing the start index of each segmented value.

Categorical

Categorical objects are stored within an HDF5 group. This group contains datasets storing the components of the Categorical.

  1. Group (user provided dataset name. Defaults to ‘categorical’)

    1. Attributes

      1. ObjType: 4

      2. file_version: 2.0 (Optional)

      3. arkouda_version: ‘current_arkouda_version’ (Optional)

    2. Dataset - Codes

      1. Attributes

        1. ObjType: 1

        2. isBool: 0

        3. file_version: 2.0 (Optional)

        4. arkouda_version: ‘current_arkouda_version’ (Optional)

      2. Data - int64 values representing our codes of the Categorical.

    3. Dataset - Categories

      1. Attributes

        1. ObjType: 2

        2. isBool: 0

        3. file_version: 2.0 (Optional)

        4. arkouda_version: ‘current_arkouda_version’ (Optional)

      2. Data - Strings group representing the categories of the Categorical.

    4. Dataset - NA_Codes

      1. Attributes

        1. ObjType: 1

        2. isBool: 0

        3. file_version: 2.0 (Optional)

        4. arkouda_version: ‘current_arkouda_version’ (Optional)

      2. Data - int64 values representing the index in of categories with NA value.

    5. Dataset - Permutation (Optional. Only include if Categorical object has permutation property)

      1. Attributes

        1. ObjType: 1

        2. isBool: 0

        3. file_version: 2.0 (Optional)

        4. arkouda_version: ‘current_arkouda_version’ (Optional)

      2. Data - int64 values representing the permutation of the categories.

    6. Dataset - Segments (Optional. Only include if Categorical object has segments property)

      1. Attributes

        1. ObjType: 1

        2. isBool: 0

        3. file_version: 2.0 (Optional)

        4. arkouda_version: ‘current_arkouda_version’ (Optional)

      2. Data - int64 values representing the start index of category segments.

GroupBy

GroupBy objects are stored within an HDF5 group. This group contains datasets storing the components of the GroupBy.

  1. Group (user provided dataset name. Defaults to ‘groupby’)

    1. Attributes

      1. ObjType: 5

      2. file_version: 2.0 (Optional)

      3. arkouda_version: ‘current_arkouda_version’ (Optional)

    2. Dataset - Permutation

      1. Attributes

        1. ObjType: 1

        2. isBool: 0

        3. file_version: 2.0 (Optional)

        4. arkouda_version: ‘current_arkouda_version’ (Optional)

      2. Data - int64 values representing the permutation of the GroupBy.

    3. Dataset - Segments

      1. Attributes

        1. ObjType: 1

        2. isBool: 0

        3. file_version: 2.0 (Optional)

        4. arkouda_version: ‘current_arkouda_version’ (Optional)

      2. Data - int64 values representing the start index of GroupBy segments.

    4. Dataset - unique_key_idx

      1. Attributes

        1. ObjType: 1

        2. isBool: 0

        3. file_version: 2.0 (Optional)

        4. arkouda_version: ‘current_arkouda_version’ (Optional)

      2. Data - int64 values representing the index of Unique keys in the GroupBy.

    5. Dataset - KEY_# (multiple keys may be present. They will be numbered accordingly)

      1. Attributes

        1. ObjType: 1, 2, or 4 (pdarray, Strings, or Categorical)

        2. isBool: 0 or 1

        3. file_version: 2.0 (Optional)

        4. arkouda_version: ‘current_arkouda_version’ (Optional)

      2. Data - Key object used to generate the GroupBy. This will be a dataset or group depending on the object type.

Supported Write Modes

Truncate

When writing to HDF5 in truncate mode, any existing HDF5 file with the same name will be overwritten. If no file exists, one will be created. If writing multiple objects, the first is written in truncate mode. All subsequent objects will then be appended to the file. The user will be notified of any overwritten files.

Append

When writing to HDF5 in append mode, all datasets will be appended to the file. If no file with the supplied name exists, one will be created. If any datasets being written have a name that is already the name of a dataset within the file, an error will be generated.

Data Distribution

Single File

If the user elects to write to a single HDF5 file, all data will be pulled to the processing node and saved to ONE file with the supplied file name. It is important to ensure that the object is small enough to prevent memory exhaustion on the node.

Distributed Files

If the user elects to write data to distributed files, data will be written to one file per locale. Each file will contain the data from the object local to the locale of that file. File names will be the name provided by the user with the suffix _LOCALE#### where #### will be replaced with the locale number. Because the data is distributed across multiple nodes, there is a much lower risk of memory exhaustion.

Legacy File Support

Older version of Arkouda used different schemas for pdarray and Strings objects (ArrayView was not supported). This format does not include the explicit ObjType attribute and requires the type to be inferred during processing. Reading these files is still supported by Arkouda. When the data type is uint8 and the object with the name dataset (user supplied dataset name) is a group containing a dataset name values the object is assumed to be of object type Strings.

API Reference

pdarray

Index

DataFrame

Strings

Categorical

SegArray

GroupBy