The DataManager#

The DataManager is at the core of dantro: it stores data in a hierarchical way, thus forming the root of a data tree, and enables the loading of data into the tree.


Overview#

Essentially, the DataManager is a specialization of a OrderedDataGroup that is extended with data loading capabilities.

It is attached to a so-called “data directory” which is the base directory where data can be loaded from.

Data Loaders#

To provide certain loading capabilities to the DataManager, the data_loaders mixin classes can be used. To learn more about specializing the data manager to have the desired loading capabilities, see here.

By default, the following mixins are available via the AllAvailableLoadersMixin:

class AllAvailableLoadersMixin[source]

Bases: dantro.data_loaders.text.TextLoaderMixin, dantro.data_loaders.yaml.YamlLoaderMixin, dantro.data_loaders.pickle.PickleLoaderMixin, dantro.data_loaders.hdf5.Hdf5LoaderMixin, dantro.data_loaders.xarray.XarrayLoaderMixin, dantro.data_loaders.numpy.NumpyLoaderMixin

A mixin bundling all data loaders that are available in dantro.

This is useful for a more convenient import in a downstream DataManager.

See the individual mixins for a more detailed documentation.

Loading Data#

To load data into the data tree, there are two main methods:

  • The load() method loads a single so-called data entry.

  • The load_from_cfg() method loads multiple such entries; the cfg refers to a set of configuration entries.

For example, having specialized a data manager, data can be loaded in the following way:

import dantro
from dantro.data_loaders import YamlLoaderMixin

class MyDataManager(YamlLoaderMixin, dantro.DataManager):
    """A DataManager specialization that can load YAML data"""

dm = MyDataManager(data_dir=my_data_dir)

# Now, data can be loaded using the `load` command:
dm.load("some_data",       # where to load the data to
        loader="yaml",     # which loader to use
        glob_str="*.yml")  # which files to find and load

# Access it
dm['some_data']
# ...

The Load Configuration#

A core concept of dantro is to make a lot of functionality available via configuration hierarchies, which are well-representable using YAML configuration files. This is also true for the DataManager, which can be initialized with a certain default load configuration, specifying multiple data entries to load.

When integrating dantro into your project, you will likely be in a situation where the structure of the data you are working with is known and more or less fixed. In such scenarios, it makes sense to pre-define which data you would like to load, how it should be loaded, and where it should be placed in the data tree.

This load configuration can be passed to the DataManager during initialization using the load_cfg argument, either as a path to a YAML file or as a dictionary. When then invoking load_from_cfg(), these default entries are loaded. Alternatively, load_from_cfg() also accepts a new load config or allows updating the default load config.

Example Load Configurations#

In the following, some advanced examples for specific load configurations are shown. These illustrate the various ways in which data can be loaded into the data tree. While most examples use only one single data entry, these can be readily combined into a common load configuration.

The basic setup for all the examples is as follows:

import dantro
from dantro.data_loaders import AllAvailableLoadersMixin

class MyDataManager(AllAvailableLoadersMixin, dantro.DataManager):
    """A DataManager specialization that can load various kinds of data"""

dm = MyDataManager(data_dir=my_data_dir, load_cfg=my_load_cfg)

The examples below are all structured in the following way:

  • First, they show the configuration that is passed as the my_load_cfg parameter, represented as yaml.

  • Then, they show the python invocation of the load_from_cfg() method, including the resulting data tree.

  • Finally, they make a few remarks on what happened.

For specific information on argument syntax, refer to the docstring of the load() method.

Defining a target path within the data tree#

The target_path option allows more control over where data is loaded to.

my_config_files:
  loader: yaml
  glob_str: 'config/*.yml'
  required: true

  # Use information from the file name to generate the target path
  path_regex: config/(\w+)_cfg.yml
  target_path: cfg/{match:}
dm.load_from_cfg(print_tree=True)
# Will print something like:
# Tree of MyDataManager, 1 member, 0 attributes
#  └─ cfg                         <OrderedDataGroup, 5 members, 0 attributes>
#     └┬ combined                 <MutableMappingContainer, 1 attribute>
#      ├ defaults                 <MutableMappingContainer, 1 attribute>
#      ├ machine                  <MutableMappingContainer, 1 attribute>
#      ├ update                   <MutableMappingContainer, 1 attribute>
#      └ user                     <MutableMappingContainer, 1 attribute>

Remarks:

  • With the required argument, an error is raised when no files were matched by glob_str.

  • With the path_regex argument, information from the path of the files can be used to generate a target_path within the tree, using the {match:} format string. In this example, this is used to drop the _cfg suffix, which would otherwise appear in the data tree. The regular expression is currently limited to a single match.

  • With a target_path given, the name of the data entry (here: my_config_files) is decoupled from the position where the data is loaded to. Without that argument and the regex, the config files would have been loaded as my_config_files/combined_cfg, for example.

Combining data entries#

The target_path option also allows combining data from different data entries, e.g. when they belong to the same measurement time:

# Load the (binary) measurement data for each day
measurement_data:
  loader: hdf5
  glob_str: measurements/day*.hdf5
  required: true
  path_regex: measurements/day(\d+).hdf5
  target_path: measurements/{match:}/data

# Load the parameter files, containing information about each day
measurement_parameters:
  loader: yaml
  glob_str: measurements/day*_params.yml
  required: true
  path_regex: measurements/day(\d+)_params.yml
  target_path: measurements/{match:}/params
dm.load_from_cfg(print_tree='condensed')
# Will print something like:
# Tree of MyDataManager, 1 member, 0 attributes
#  └─ measurements                <OrderedDataGroup, 42 members, 0 attributes>
#     └┬ 000                      <OrderedDataGroup, 2 members, 0 attributes>
#        └┬ params                <MutableMappingContainer, 1 attribute>
#         └ data                  <OrderedDataGroup, 3 members, 0 attributes>
#           └┬ precipitation      <NumpyDataContainer, int64, shape (126,), 0 at…
#            ├ sensor_data        <OrderedDataGroup, 23 members, 1 attribute>
#              └┬ sensor000       <NumpyDataContainer, float64, shape (3, 89), 0 attributes>
#               ├ sensor001       <NumpyDataContainer, float64, shape (3, 85), 0 attributes>
#               ├ sensor002       <NumpyDataContainer, float64, shape (3, 94), 0 attributes>
#               ├ ...             ... (18 more) ...
#               ├ sensor021       <NumpyDataContainer, float64, shape (3, 80), 0 attributes>
#               └ sensor022       <NumpyDataContainer, float64, shape (3, 99), 0 attributes>
#            └ temperatures       <NumpyDataContainer, float64, shape (126,), 0 attributes>
#      ├ 001                      <OrderedDataGroup, 2 members, 0 attributes>
#        └┬ params                <MutableMappingContainer, 1 attribute>
#         └ data                  <OrderedDataGroup, 3 members, 0 attributes>
#           └┬ precipitation      <NumpyDataContainer, int64, shape (150,), 0 attributes>
#            ├ sensor_data        <OrderedDataGroup, 23 members, 1 attribute>
#              └┬ sensor000       <NumpyDataContainer, float64, shape (3, 99), 0 attributes>
#               ├ sensor001       <NumpyDataContainer, float64, shape (3, 85), 0 attributes>
#               ├ ...

Loading data as container attributes#

In some scenarios, it is desirable to load some data not as a regular entry into the data tree, but as a container attribute. Continuing with the example from above, we might want to load the parameters directly into the container for each day.

# Load the (binary) measurement data for each day
measurement_data:
  loader: hdf5
  glob_str: measurements/day*.hdf5
  required: true
  path_regex: measurements/day(\d+).hdf5
  target_path: measurements/{match:}

# Load the parameter files as container attributes
params:
  loader: yaml
  glob_str: measurements/day*_params.yml
  required: true
  load_as_attr: true
  unpack_data: true
  path_regex: measurements/day(\d+)_params.yml
  target_path: measurements/{match:}
dm.load_from_cfg(print_tree='condensed')
# Will print something like:
# Tree of MyDataManager , 1 member, 0 attributes
#  └─ measurements                <OrderedDataGroup, 42 members, 0 attributes>
#     └┬ 000                      <OrderedDataGroup, 3 members, 1 attribute>
#        └┬ precipitation         <NumpyDataContainer, int64, shape (165,), 0 attributes>
#         ├ sensor_data           <OrderedDataGroup, 23 members, 1 attribute>
#           └┬ sensor000          <NumpyDataContainer, float64, shape (3, 92), 0 attributes>
#            ├ sensor001          <NumpyDataContainer, float64, shape (3, 91), 0 attributes>
#            ├ sensor002          <NumpyDataContainer, float64, shape (3, 93), 0 attributes>
#            ├ ...                ... (18 more) ...
#            ├ sensor021          <NumpyDataContainer, float64, shape (3, 83), 0 attributes>
#            └ sensor022          <NumpyDataContainer, float64, shape (3, 97), 0 attributes>
#         └ temperatures          <NumpyDataContainer, float64, shape (165,), 0 attributes>
#      ├ 001                      <OrderedDataGroup, 3 members, 1 attribute>
#        └┬ precipitation         <NumpyDataContainer, int64, shape (181,), 0 attributes>
#         ├ sensor_data           <OrderedDataGroup, 23 members, 1 attribute>
#           └┬ sensor000          <NumpyDataContainer, float64, shape (3, 84), 0 attributes>
#            ├ sensor001          <NumpyDataContainer, float64, shape (3, 85), 0 attributes>
#            ├ ...

# Check attribute access to the parameters
for cont_name, data in dm['measurements'].items():
    params = data.attrs['params']
    assert params['day'] == int(cont_name)

Note the 000 group showing one more attribute than in previous examples; this is the params attribute.

Remarks:

  • By using load_as_attr, the measurement parameters are made available as container attribute and become accessible via its attrs property. (This is not to be confused with regular python object attributes.)

  • When using load_as_attr, the entry name is used as the attribute name.

  • The unpack_data option makes the stored object a dictionary, rather than a MutableMappingContainer, reducing one level of indirection.

Prescribing tree structure and specializations#

Sometimes, load configurations become easier to handle when an empty tree structure is created prior to loading. This can be done using the DataManager‘s create_groups argument, also allowing to specify custom group classes, e.g. to denote a time series.

# Load the (binary) measurement data for each day
measurement_data:
  loader: hdf5
  glob_str: measurements/day*.hdf5
  required: true
  path_regex: measurements/day(\d+).hdf5
  target_path: measurements/{match:}
from dantro.groups import TimeSeriesGroup

dm = MyDataManager(data_dir=my_data_dir, out_dir=False,
                   load_cfg=my_load_cfg,
                   create_groups=[dict(path='measurements',
                                       Cls=TimeSeriesGroup)])

dm.load_from_cfg(print_tree='condensed')
# Will print something like:
# Tree of MyDataManager , 1 member, 0 attributes
#  └─ measurements                <TimeSeriesGroup, 42 members, 0 attributes>
#     └┬ 000                      <OrderedDataGroup, 3 members, 0 attributes>
#        └┬ precipitation         <NumpyDataContainer, int64, shape (165,), 0 attributes>
#         ├ sensor_data           <OrderedDataGroup, 23 members, 1 attribute>
#           └┬ sensor000          <NumpyDataContainer, float64, shape (3, 92), 0 attributes>
#            ├ sensor001          <NumpyDataContainer, float64, shape (3, 91), 0 attributes>
#            ├ sensor002          <NumpyDataContainer, float64, shape (3, 93), 0 attributes>
#            ├ ...

Remarks:

  • Multiple paths can be specified in create_groups.

  • Paths can also have multiple segments, like my/custom/group/path.

  • The dm['measurements'] entry is now a TimeSeriesGroup, and thus represents one dimension of the stored data, e.g. the precipitation data.

Loading data as proxy#

Sometimes, data is too large to be loaded into memory completely. For example, if we are only interested in the precipitation data, the sensor data should not be loaded into memory.

Dantro provides a mechanism to build the data tree using placeholder objects, so-called proxies. The following example illustrates that, and furthermore uses the dask framework to allow delayed computations.

# Load the (binary) measurement data for each day
measurement_data:
  loader: hdf5
  glob_str: measurements/day*.hdf5
  required: true
  path_regex: measurements/day(\d+).hdf5
  target_path: measurements/{match:}
  load_as_proxy: true
  proxy_kwargs:
    resolve_as_dask: true
from dantro.containers import XrDataContainer
from dantro.mixins import Hdf5ProxySupportMixin

class MyXrDataContainer(Hdf5ProxySupportMixin, XrDataContainer):
    """An xarray data container that allows proxy data"""

class MyDataManager(AllAvailableLoadersMixin, dantro.DataManager):
    """A DataManager specialization that can load various kinds of data
    and uses containers that supply proxy support
    """
    # Configure the HDF5 loader to use the custom xarray container
    _HDF5_DSET_DEFAULT_CLS = MyXrDataContainer

dm = MyDataManager(data_dir=my_data_dir, out_dir=False,
                   load_cfg=my_load_cfg)
dm.load_from_cfg(print_tree='condensed')
# Will print something like:
# Tree of MyDataManager , 1 member, 0 attributes
#  └─ measurements                <OrderedDataGroup, 42 members, 0 attributes>
#     └┬ 000                      <OrderedDataGroup, 3 members, 0 attributes>
#        └┬ precipitation         <MyXrDataContainer, proxy (hdf5, dask), int64, shape (165,), 0 attributes>
#         ├ sensor_data           <OrderedDataGroup, 23 members, 1 attribute>
#           └┬ sensor000          <MyXrDataContainer, proxy (hdf5, dask), float64, shape (3, 92), 0 attributes>
#            ├ sensor001          <MyXrDataContainer, proxy (hdf5, dask), float64, shape (3, 91), 0 attributes>
#            ├ sensor002          <MyXrDataContainer, proxy (hdf5, dask), float64, shape (3, 93), 0 attributes>
#            ├ ...

# Work with the data in the same way as before; it's loaded on the fly
total_precipitation = 0.
for day_data in dm['measurements'].values():
    total_precipitation += day_data['precipitation'].sum()

Remarks:

For details about loading large data using proxies and dask, see Handling Large Amounts of Data.