HDF5¶
HDF5 is an extremely flexible format. Because of this, it is important to adhere to these specifications in order for a file to be readable by Arkouda.
More information on HDF5 is available here.
File Configuration¶
Arkouda supports saving HDF5 files in 2 ways:
Single File
All data is pulled local to the processing root node and saved into one file
Distributed Files (Default)
Each file contains the portion of the data local to the locale where the file is being written. This results in one file per locale.
It is important to note that the file schemas are the same in both cases.
Supported Arkouda Data Types¶
While most objects in Arkouda can be saved, there are 3 main datatypes currently supported within HDF5.
pdarray
Strings
ArrayView
DataFrame
Index
Categorical
SegArray
GroupBy
HDF5 is able to contain any number of objects within the same file.
MetaData Attributes¶
All data within the HDF5 file is expected to contain several attributes that aide in determining the data within the object. These attributes are assigned at the Group
and Dataset
levels.
ObjType
: int
Integer representing the type of object stored in the group/dataset. This corresponds to the Chapel
enum ObjType
. Required to properly read each object.
0 =
ArrayView
1 =
pdarray
2 =
Strings
3 =
SegArray
4 =
Categorical
5 =
GroupBy
isBool
: int
Integer value (0 or 1) representing a boolean value that indicates if the data stored contains boolean values. This is only required to be set when the dataset contains boolean values.
file_version
: real(32)
(Optional)
Real value indicating the formatting version.
0.0
and1.0
are no longer in use. Should be2.0
.
arkouda_version
: c_string
(Optional)
String value of the Arkouda version at the time the object was written.
The 2 attributes marked Optional
are not required for data to be read. Thus, if you are reading data into Arkouda from another source, these can be omitted. However, any dataset written out by Arkodua will include this information.
Additional object types are being worked for direct support.
Data Schema¶
This section provides an outline of the expected data schema for each object type. Each example assumes the top level group/dataset is not nested.
When reading array values, the data type of the values is automatically detected and is therefore not required to be included in the metadata.
ArrayView¶
ArrayView
objects require additional attributes to be read properly.
Rank
: int
Integer representing the number of dimensions in the dataset. This should be stored as the rank of the unflattened data, even when storing as a flattened array.
Shape
: int array
Integer array storing the size of each dimension. The array should be of length equal to the
Rank
.
Providing these attributes allows for the ArrayView object to be reconstructed from its values.
Dataset (will have a user provided name. Defaults to ‘ArrayView’)
Attributes
ObjType: 0
isBool: 0 or 1
Rank:
number_of_dimensions
Shape:
array_of_size_rank
file_version: 2.0 (Optional)
arkouda_version: ‘current_arkouda_version’ (Optional)
Data - values of the ArrayView.
pdarray¶
Dataset (will have a user provided name. Defaults to ‘array’)
Attributes
ObjType: 1
isBool: 0 or 1
file_version: 2.0 (Optional)
arkouda_version: ‘current_arkouda_version’ (Optional)
Data - values of the pdarray.
Strings¶
Strings
objects are stored within an HDF5 group. This group contains datasets storing the values and segments separately.
Group (user provided dataset name. Defaults to ‘strings_array’)
Attributes
ObjType: 2
file_version: 2.0 (Optional)
arkouda_version: ‘current_arkouda_version’ (Optional)
Dataset - Values (user provided dataset name with
_values
appended)
Attributes
ObjType: 1
isBool: 0 or 1
file_version: 2.0 (Optional)
arkouda_version: ‘current_arkouda_version’ (Optional)
Data - uint8 values representing our string values. Includes null byte termination.
Dataset - Offsets (user provided dataset name with
_segments
appended) (Optional)
Attributes
ObjType: 1
isBool: 0
file_version: 2.0 (Optional)
arkouda_version: ‘current_arkouda_version’ (Optional)
Data - int64 values representing in start index of each string value.
Please Note - The offsets dataset is not required but can be provided. Strings uses null byte termination and is able to calculate the offsets of its components during reads.
SegArray¶
SegArray
objects are stored within an HDF5 group. This group contains datasets storing the values and segments separately.
Group (user provided dataset name. Defaults to ‘segarray’)
Attributes
ObjType: 3
file_version: 2.0 (Optional)
arkouda_version: ‘current_arkouda_version’ (Optional)
Dataset - Values
Attributes
ObjType: 1 or 2
isBool: 0 or 1
file_version: 2.0 (Optional)
arkouda_version: ‘current_arkouda_version’ (Optional)
Data - numeric values representing our string values. int64, uint64, float64, or bool.
Dataset - Offsets
Attributes
ObjType: 1
isBool: 0
file_version: 2.0 (Optional)
arkouda_version: ‘current_arkouda_version’ (Optional)
Data - int64 values representing the start index of each segmented value.
Categorical¶
Categorical
objects are stored within an HDF5 group. This group contains datasets storing the components of the Categorical.
Group (user provided dataset name. Defaults to ‘categorical’)
Attributes
ObjType: 4
file_version: 2.0 (Optional)
arkouda_version: ‘current_arkouda_version’ (Optional)
Dataset - Codes
Attributes
ObjType: 1
isBool: 0
file_version: 2.0 (Optional)
arkouda_version: ‘current_arkouda_version’ (Optional)
Data - int64 values representing our codes of the Categorical.
Dataset - Categories
Attributes
ObjType: 2
isBool: 0
file_version: 2.0 (Optional)
arkouda_version: ‘current_arkouda_version’ (Optional)
Data - Strings group representing the categories of the Categorical.
Dataset - NA_Codes
Attributes
ObjType: 1
isBool: 0
file_version: 2.0 (Optional)
arkouda_version: ‘current_arkouda_version’ (Optional)
Data - int64 values representing the index in of categories with NA value.
Dataset - Permutation (Optional. Only include if Categorical object has permutation property)
Attributes
ObjType: 1
isBool: 0
file_version: 2.0 (Optional)
arkouda_version: ‘current_arkouda_version’ (Optional)
Data - int64 values representing the permutation of the categories.
Dataset - Segments (Optional. Only include if Categorical object has segments property)
Attributes
ObjType: 1
isBool: 0
file_version: 2.0 (Optional)
arkouda_version: ‘current_arkouda_version’ (Optional)
Data - int64 values representing the start index of category segments.
GroupBy¶
GroupBy
objects are stored within an HDF5 group. This group contains datasets storing the components of the GroupBy.
Group (user provided dataset name. Defaults to ‘groupby’)
Attributes
ObjType: 5
file_version: 2.0 (Optional)
arkouda_version: ‘current_arkouda_version’ (Optional)
Dataset - Permutation
Attributes
ObjType: 1
isBool: 0
file_version: 2.0 (Optional)
arkouda_version: ‘current_arkouda_version’ (Optional)
Data - int64 values representing the permutation of the GroupBy.
Dataset - Segments
Attributes
ObjType: 1
isBool: 0
file_version: 2.0 (Optional)
arkouda_version: ‘current_arkouda_version’ (Optional)
Data - int64 values representing the start index of GroupBy segments.
Dataset - unique_key_idx
Attributes
ObjType: 1
isBool: 0
file_version: 2.0 (Optional)
arkouda_version: ‘current_arkouda_version’ (Optional)
Data - int64 values representing the index of Unique keys in the GroupBy.
Dataset - KEY_# (multiple keys may be present. They will be numbered accordingly)
Attributes
ObjType: 1, 2, or 4 (pdarray, Strings, or Categorical)
isBool: 0 or 1
file_version: 2.0 (Optional)
arkouda_version: ‘current_arkouda_version’ (Optional)
Data - Key object used to generate the GroupBy. This will be a dataset or group depending on the object type.
Supported Write Modes¶
Truncate
When writing to HDF5 in
truncate
mode, any existing HDF5 file with the same name will be overwritten. If no file exists, one will be created. If writing multiple objects, the first is written intruncate
mode. All subsequent objects will then be appended to the file. The user will be notified of any overwritten files.
Append
When writing to HDF5 in
append
mode, all datasets will be appended to the file. If no file with the supplied name exists, one will be created. If any datasets being written have a name that is already the name of a dataset within the file, an error will be generated.
Data Distribution¶
Single File
If the user elects to write to a single HDF5 file, all data will be pulled to the processing node and saved to ONE file with the supplied file name. It is important to ensure that the object is small enough to prevent memory exhaustion on the node.
Distributed Files
If the user elects to write data to distributed files, data will be written to one file per locale. Each file will contain the data from the object local to the locale of that file. File names will be the name provided by the user with the suffix
_LOCALE####
where####
will be replaced with the locale number. Because the data is distributed across multiple nodes, there is a much lower risk of memory exhaustion.
Legacy File Support¶
Older version of Arkouda used different schemas for pdarray
and Strings
objects (ArrayView
was not supported). This format does not include the explicit ObjType
attribute and requires the type to be inferred during processing. Reading these files is still supported by Arkouda. When the data type is uint8
and the object with the name dataset
(user supplied dataset name) is a group containing a dataset name values
the object is assumed to be of object type Strings.