The DataManager#

The DataManager is at the core of dantro: it stores data in a hierarchical way, thus forming the root of a data tree, and enables the loading of data into the tree.


Overview#

Essentially, the DataManager is a specialization of a OrderedDataGroup that is extended with data loading capabilities.

It is attached to a so-called “data directory” which is the base directory where data can be loaded from.

Data Loaders#

To provide certain loading capabilities to the DataManager, the data_loaders mixin classes can be used. To learn more about specializing the data manager to have the desired loading capabilities, see here.

By default, the following mixins are available via the AllAvailableLoadersMixin:

class AllAvailableLoadersMixin[source]

Bases: dantro.data_loaders.text.TextLoaderMixin, dantro.data_loaders.yaml.YamlLoaderMixin, dantro.data_loaders.pickle.PickleLoaderMixin, dantro.data_loaders.hdf5.Hdf5LoaderMixin, dantro.data_loaders.xarray.XarrayLoaderMixin, dantro.data_loaders.pandas.PandasLoaderMixin, dantro.data_loaders.numpy.NumpyLoaderMixin

A mixin bundling all data loaders that are available in dantro. See the individual mixins for a more detailed documentation.

If you want all these loaders available in your data manager, inherit from this mixin class and DataManager:

import dantro

class MyDataManager(
    dantro.data_loaders.AllAvailableLoadersMixin,
    dantro.DataManager,
):
    pass

Load functions#

To see which data loading functions are available for a certain DataManager instance, you can use available_loaders. The output of that property can be used as the loader argument to the load() method, see Loading Data.

In [1]: import dantro

In [2]: class MyDataManager(dantro.data_loaders.AllAvailableLoadersMixin, dantro.DataManager):
   ...:     """My custom DataManager"""
   ...: 

# Instantiate it with some directory
In [3]: dm = MyDataManager("~")

# Print the names of all available data loaders, i.e. the `loader` argument
In [4]: print("\n".join(dm.available_loaders))
hdf5
hdf5_as_dask
hdf5_proxy
numpy
numpy_binary
numpy_txt
pandas_csv
pandas_generic
pickle
pkl
plain_text
text
xr_dataarray
xr_dataset
yaml
yaml_to_object
yml
yml_to_object

Hint

To learn about available arguments for these loaders, have a look at the API reference for the corresponding mixin, e.g. starting from the AllAvailableLoadersMixin.

Missing a loader?

No problem. You can easily specialize your data manager to include a custom loader.

Loading Data#

To load data into the data tree, there are two main methods:

  • The load() method loads a single so-called data entry.

  • The load_from_cfg() method loads multiple such entries; the cfg refers to a set of configuration entries.

For example, having specialized a data manager, data can be loaded in the following way:

import dantro
from dantro.data_loaders import YamlLoaderMixin

class MyDataManager(YamlLoaderMixin, dantro.DataManager):
    """A DataManager specialization that can load YAML data"""

dm = MyDataManager(data_dir=my_data_dir)

# Now, data can be loaded using the `load` command:
dm.load("some_data",       # where to load the data to
        loader="yaml",     # which loader to use
        glob_str="*.yml")  # which files to find and load

# Access it
dm["some_data"]
# ...

The Load Configuration#

A core concept of dantro is to make a lot of functionality available via configuration hierarchies, which are well-representable using YAML configuration files. This is also true for the DataManager, which can be initialized with a certain default load configuration, specifying multiple data entries to load.

When integrating dantro into your project, you will likely be in a situation where the structure of the data you are working with is known and more or less fixed. In such scenarios, it makes sense to pre-define which data you would like to load, how it should be loaded, and where it should be placed in the data tree.

This load configuration can be passed to the DataManager during initialization using the load_cfg argument, either as a path to a YAML file or as a dictionary. When then invoking load_from_cfg(), these default entries are loaded. Alternatively, load_from_cfg() also accepts a new load config or allows updating the default load config.

Example Load Configurations#

In the following, some advanced examples for specific load configurations are shown. These illustrate the various ways in which data can be loaded into the data tree. While most examples use only one single data entry, these can be readily combined into a common load configuration.

The basic setup for all the examples is as follows:

import dantro
from dantro.data_loaders import AllAvailableLoadersMixin

class MyDataManager(AllAvailableLoadersMixin, dantro.DataManager):
    """A DataManager specialization that can load various kinds of data"""

dm = MyDataManager(data_dir=my_data_dir, load_cfg=my_load_cfg)

The examples below are all structured in the following way:

  • First, they show the configuration that is passed as the my_load_cfg parameter, represented as yaml.

  • Then, they show the python invocation of the load_from_cfg() method, including the resulting data tree.

  • Finally, they make a few remarks on what happened.

For specific information on argument syntax, refer to the docstring of the load() method.

Defining a target path within the data tree#

The target_path option allows more control over where data is loaded to.

my_config_files:
  loader: yaml
  glob_str: 'config/*.yml'
  required: true

  # Use information from the file name to generate the target path
  path_regex: config/(\w+)_cfg.yml
  target_path: cfg/{match:}
dm.load_from_cfg(print_tree=True)
# Will print something like:
# Tree of MyDataManager, 1 member, 0 attributes
#  └─ cfg                         <OrderedDataGroup, 5 members, 0 attributes>
#     └┬ combined                 <MutableMappingContainer, 1 attribute>
#      ├ defaults                 <MutableMappingContainer, 1 attribute>
#      ├ machine                  <MutableMappingContainer, 1 attribute>
#      ├ update                   <MutableMappingContainer, 1 attribute>
#      └ user                     <MutableMappingContainer, 1 attribute>

Remarks:

  • With the required argument, an error is raised when no files were matched by glob_str.

  • With the path_regex argument, information from the path of the files can be used to generate a target_path within the tree, using the {match:} format string. In this example, this is used to drop the _cfg suffix, which would otherwise appear in the data tree.

  • With a target_path given, the name of the data entry (here: my_config_files) is decoupled from the position where the data is loaded to. Without that argument and the regex, the config files would have been loaded as my_config_files/combined_cfg, for example.

Hint

The regular expression in path_regex is not limited to a single match, it can also have multiple matching groups, which can be unnamed or named groups. See _prepare_target_path() for more options and examples.

Combining data entries#

The target_path option also allows combining data from different data entries, e.g. when they belong to the same measurement time:

# Load the (binary) measurement data for each day
measurement_data:
  loader: hdf5
  glob_str: measurements/day*.hdf5
  required: true
  path_regex: measurements/day(\d+).hdf5
  target_path: measurements/{match:}/data

# Load the parameter files, containing information about each day
measurement_parameters:
  loader: yaml
  glob_str: measurements/day*_params.yml
  required: true
  path_regex: measurements/day(\d+)_params.yml
  target_path: measurements/{match:}/params
dm.load_from_cfg(print_tree="condensed")
# Will print something like:
# Tree of MyDataManager, 1 member, 0 attributes
#  └─ measurements                <OrderedDataGroup, 42 members, 0 attributes>
#     └┬ 000                      <OrderedDataGroup, 2 members, 0 attributes>
#        └┬ params                <MutableMappingContainer, 1 attribute>
#         └ data                  <OrderedDataGroup, 3 members, 0 attributes>
#           └┬ precipitation      <NumpyDataContainer, int64, shape (126,), 0 at…
#            ├ sensor_data        <OrderedDataGroup, 23 members, 1 attribute>
#              └┬ sensor000       <NumpyDataContainer, float64, shape (3, 89), 0 attributes>
#               ├ sensor001       <NumpyDataContainer, float64, shape (3, 85), 0 attributes>
#               ├ sensor002       <NumpyDataContainer, float64, shape (3, 94), 0 attributes>
#               ├ ...             ... (18 more) ...
#               ├ sensor021       <NumpyDataContainer, float64, shape (3, 80), 0 attributes>
#               └ sensor022       <NumpyDataContainer, float64, shape (3, 99), 0 attributes>
#            └ temperatures       <NumpyDataContainer, float64, shape (126,), 0 attributes>
#      ├ 001                      <OrderedDataGroup, 2 members, 0 attributes>
#        └┬ params                <MutableMappingContainer, 1 attribute>
#         └ data                  <OrderedDataGroup, 3 members, 0 attributes>
#           └┬ precipitation      <NumpyDataContainer, int64, shape (150,), 0 attributes>
#            ├ sensor_data        <OrderedDataGroup, 23 members, 1 attribute>
#              └┬ sensor000       <NumpyDataContainer, float64, shape (3, 99), 0 attributes>
#               ├ sensor001       <NumpyDataContainer, float64, shape (3, 85), 0 attributes>
#               ├ ...

Loading data as container attributes#

In some scenarios, it is desirable to load some data not as a regular entry into the data tree, but as a container attribute. Continuing with the example from above, we might want to load the parameters directly into the container for each day.

# Load the (binary) measurement data for each day
measurement_data:
  loader: hdf5
  glob_str: measurements/day*.hdf5
  required: true
  path_regex: measurements/day(\d+).hdf5
  target_path: measurements/{match:}

# Load the parameter files as container attributes
params:
  loader: yaml
  glob_str: measurements/day*_params.yml
  required: true
  load_as_attr: true
  unpack_data: true
  path_regex: measurements/day(\d+)_params.yml
  target_path: measurements/{match:}
dm.load_from_cfg(print_tree="condensed")
# Will print something like:
# Tree of MyDataManager , 1 member, 0 attributes
#  └─ measurements                <OrderedDataGroup, 42 members, 0 attributes>
#     └┬ 000                      <OrderedDataGroup, 3 members, 1 attribute>
#        └┬ precipitation         <NumpyDataContainer, int64, shape (165,), 0 attributes>
#         ├ sensor_data           <OrderedDataGroup, 23 members, 1 attribute>
#           └┬ sensor000          <NumpyDataContainer, float64, shape (3, 92), 0 attributes>
#            ├ sensor001          <NumpyDataContainer, float64, shape (3, 91), 0 attributes>
#            ├ sensor002          <NumpyDataContainer, float64, shape (3, 93), 0 attributes>
#            ├ ...                ... (18 more) ...
#            ├ sensor021          <NumpyDataContainer, float64, shape (3, 83), 0 attributes>
#            └ sensor022          <NumpyDataContainer, float64, shape (3, 97), 0 attributes>
#         └ temperatures          <NumpyDataContainer, float64, shape (165,), 0 attributes>
#      ├ 001                      <OrderedDataGroup, 3 members, 1 attribute>
#        └┬ precipitation         <NumpyDataContainer, int64, shape (181,), 0 attributes>
#         ├ sensor_data           <OrderedDataGroup, 23 members, 1 attribute>
#           └┬ sensor000          <NumpyDataContainer, float64, shape (3, 84), 0 attributes>
#            ├ sensor001          <NumpyDataContainer, float64, shape (3, 85), 0 attributes>
#            ├ ...

# Check attribute access to the parameters
for cont_name, data in dm["measurements"].items():
    params = data.attrs["params"]
    assert params["day"] == int(cont_name)

Note the 000 group showing one more attribute than in previous examples; this is the params attribute.

Remarks:

  • By using load_as_attr, the measurement parameters are made available as container attribute and become accessible via its attrs property. (This is not to be confused with regular python object attributes.)

  • When using load_as_attr, the entry name is used as the attribute name.

  • The unpack_data option makes the stored object a dictionary, rather than a MutableMappingContainer, reducing one level of indirection.

Prescribing tree structure and specializations#

Sometimes, load configurations become easier to handle when an empty tree structure is created prior to loading. This can be done using the DataManager‘s create_groups argument, also allowing to specify custom group classes, e.g. to denote a time series.

# Load the (binary) measurement data for each day
measurement_data:
  loader: hdf5
  glob_str: measurements/day*.hdf5
  required: true
  path_regex: measurements/day(\d+).hdf5
  target_path: measurements/{match:}
from dantro.groups import TimeSeriesGroup

dm = MyDataManager(data_dir=my_data_dir, out_dir=False,
                   load_cfg=my_load_cfg,
                   create_groups=[dict(path="measurements",
                                       Cls=TimeSeriesGroup)])

dm.load_from_cfg(print_tree="condensed")
# Will print something like:
# Tree of MyDataManager , 1 member, 0 attributes
#  └─ measurements                <TimeSeriesGroup, 42 members, 0 attributes>
#     └┬ 000                      <OrderedDataGroup, 3 members, 0 attributes>
#        └┬ precipitation         <NumpyDataContainer, int64, shape (165,), 0 attributes>
#         ├ sensor_data           <OrderedDataGroup, 23 members, 1 attribute>
#           └┬ sensor000          <NumpyDataContainer, float64, shape (3, 92), 0 attributes>
#            ├ sensor001          <NumpyDataContainer, float64, shape (3, 91), 0 attributes>
#            ├ sensor002          <NumpyDataContainer, float64, shape (3, 93), 0 attributes>
#            ├ ...

Remarks:

  • Multiple paths can be specified in create_groups.

  • Paths can also have multiple segments, like my/custom/group/path.

  • The dm['measurements'] entry is now a TimeSeriesGroup, and thus represents one dimension of the stored data, e.g. the precipitation data.

Loading data as proxy#

Sometimes, data is too large to be loaded into memory completely. For example, if we are only interested in the precipitation data, the sensor data should not be loaded into memory.

Dantro provides a mechanism to build the data tree using placeholder objects, so-called proxies. The following example illustrates that, and furthermore uses the dask framework to allow delayed computations.

# Load the (binary) measurement data for each day
measurement_data:
  loader: hdf5
  glob_str: measurements/day*.hdf5
  required: true
  path_regex: measurements/day(\d+).hdf5
  target_path: measurements/{match:}
  load_as_proxy: true
  proxy_kwargs:
    resolve_as_dask: true
from dantro.containers import XrDataContainer
from dantro.mixins import Hdf5ProxySupportMixin

class MyXrDataContainer(Hdf5ProxySupportMixin, XrDataContainer):
    """An xarray data container that allows proxy data"""

class MyDataManager(AllAvailableLoadersMixin, dantro.DataManager):
    """A DataManager specialization that can load various kinds of data
    and uses containers that supply proxy support
    """
    # Configure the HDF5 loader to use the custom xarray container
    _HDF5_DSET_DEFAULT_CLS = MyXrDataContainer

dm = MyDataManager(data_dir=my_data_dir, out_dir=False,
                   load_cfg=my_load_cfg)
dm.load_from_cfg(print_tree="condensed")
# Will print something like:
# Tree of MyDataManager , 1 member, 0 attributes
#  └─ measurements                <OrderedDataGroup, 42 members, 0 attributes>
#     └┬ 000                      <OrderedDataGroup, 3 members, 0 attributes>
#        └┬ precipitation         <MyXrDataContainer, proxy (hdf5, dask), int64, shape (165,), 0 attributes>
#         ├ sensor_data           <OrderedDataGroup, 23 members, 1 attribute>
#           └┬ sensor000          <MyXrDataContainer, proxy (hdf5, dask), float64, shape (3, 92), 0 attributes>
#            ├ sensor001          <MyXrDataContainer, proxy (hdf5, dask), float64, shape (3, 91), 0 attributes>
#            ├ sensor002          <MyXrDataContainer, proxy (hdf5, dask), float64, shape (3, 93), 0 attributes>
#            ├ ...

# Work with the data in the same way as before; it's loaded on the fly
total_precipitation = 0.
for day_data in dm["measurements"].values():
    total_precipitation += day_data["precipitation"].sum()

Remarks:

For details about loading large data using proxies and dask, see Handling Large Amounts of Data.