The DataManager

The DataManager is at the core of dantro: it stores data in a hierarchical way, thus forming the root of a data tree, and enables the loading of data into the tree.


Overview

Essentially, the DataManager is a specialization of a OrderedDataGroup that is extended with data loading capabilities.

It is attached to a data directory which is seen as the directory to load data from.

Todo

Write more here.

Data Loaders

To provide certain loading capabilities to the DataManager, the data_loaders mixin classes can be used.

class dantro.data_loaders.AllAvailableLoadersMixin[source]

Bases: dantro.data_loaders.load_yaml.YamlLoaderMixin, dantro.data_loaders.load_pkl.PickleLoaderMixin, dantro.data_loaders.load_hdf5.Hdf5LoaderMixin, dantro.data_loaders.load_xarray.XarrayLoaderMixin, dantro.data_loaders.load_numpy.NumpyLoaderMixin

A mixin bundling all available data loaders.

This is useful for a more convenient import in a downstream DataManager.

To learn more about the specialization, see here.

Loading Data

To load data into the data tree, there are two methods:

  • The load() method loads a single so-called data entry.
  • The load_from_cfg() method loads multiple such entries; the cfg refers to a set of configuration entries.

For example, having specialized a data manager, data can be loaded in the following way:

dm = MyDataManager(data_dir="~/my_data")

# Now, data can be loaded using the `load` command:
dm.load("some_data",       # where to load the data to
        loader="yaml",     # which loader to use
        glob_str="*.yml")  # which files to find and load

# Access it
dm['some_data']
# ...

The Load Configuration

A core concept of dantro is to make a lot of functionality available via YAML-based configuration files. This is also true for the DataManager, which can be initialized with a certain load configuration which specifies the data entries to load.

For a known structure of the output data, it makes sense to pre-define the configuration somewhere and use that configuration to load all required data. This configuration can be passed to the DataManager during initialization using the load_cfg argument.

An example for a rather complex load configuration is from the Utopia project:

# Supply a default load configuration for the DataManager
load_cfg:
  # Load the frontend configuration files from the config/ directory
  # Each file refers to a level of the configuration that is supplied to
  # the Multiverse: base <- user <- model <- run <- update
  cfg:
    loader: yaml
    glob_str: 'config/*.yml'
    required: true
    path_regex: config/(\w+)_cfg.yml
    target_path: cfg/{match:}

  # Load the configuration files that are generated for _each_ simulation
  # These hold all information that is available to a single simulation and
  # are in an explicit, human-readable form.
  uni_cfg:
    loader: yaml
    glob_str: universes/uni*/config.yml
    required: true
    path_regex: universes/uni(\d+)/config.yml
    target_path: uni/{match:}/cfg

  # Load the binary output data from each simulation.
  data:
    loader: hdf5_proxy
    glob_str: universes/uni*/data.h5
    required: true
    path_regex: universes/uni(\d+)/data.h5
    target_path: uni/{match:}/data

Once the DataManager is configured this way, it becomes very easy to load all configured data entries via load_from_cfg():

dm = MyDataManager(data_dir="~/my_data", load_cfg=load_cfg_dict)
dm.load_from_cfg()

The resulting data tree is:

…thus allowing access in the following way:

# Access the data
meta_cfg = dm['cfg/meta']
some_param = cfg['some']['parameter']

# Do something with the universes
for uni_name, uni in dm['uni'].items():
    print("Current universe: ", uni_name)
    do_something_with(data=uni['data'], cfg=uni['cfg'])