The DataManager
Contents
The DataManager
#
The DataManager
is at the core of dantro: it stores data in a hierarchical way, thus forming the root of a data tree, and enables the loading of data into the tree.
Overview#
Essentially, the DataManager
is a specialization of a OrderedDataGroup
that is extended with data loading capabilities.
It is attached to a so-called “data directory” which is the base directory where data can be loaded from.
Data Loaders#
To provide certain loading capabilities to the DataManager
, the data_loaders
mixin classes can be used.
To learn more about specializing the data manager to have the desired loading capabilities, see here.
By default, the following mixins are available via the AllAvailableLoadersMixin
:
- class AllAvailableLoadersMixin[source]
Bases:
dantro.data_loaders.text.TextLoaderMixin
,dantro.data_loaders.fspath.FSPathLoaderMixin
,dantro.data_loaders.yaml.YamlLoaderMixin
,dantro.data_loaders.pickle.PickleLoaderMixin
,dantro.data_loaders.hdf5.Hdf5LoaderMixin
,dantro.data_loaders.xarray.XarrayLoaderMixin
,dantro.data_loaders.pandas.PandasLoaderMixin
,dantro.data_loaders.numpy.NumpyLoaderMixin
A mixin bundling all data loaders that are available in dantro. See the individual mixins for a more detailed documentation.
If you want all these loaders available in your data manager, inherit from this mixin class and
DataManager
:import dantro class MyDataManager( dantro.data_loaders.AllAvailableLoadersMixin, dantro.DataManager, ): pass
All these are also available via the DATA_LOADERS
registry.
The “vanilla” DataManager
can access all these loaders directly, even without mixins.
Load functions#
To see which data loading functions are available for a certain DataManager
instance, you can use available_loaders
.
The output of that property can be used as the loader
argument to the load()
method, see Loading Data.
In [1]: import dantro
In [2]: class MyDataManager(dantro.data_loaders.AllAvailableLoadersMixin, dantro.DataManager):
...: """My custom DataManager"""
...:
# Instantiate it with some directory
In [3]: dm = MyDataManager("~")
# Print the names of all available data loaders, i.e. the `loader` argument
In [4]: print("\n".join(dm.available_loaders))
fspath
fstree
hdf5
hdf5_as_dask
hdf5_proxy
numpy
numpy_binary
numpy_txt
pandas_csv
pandas_generic
pickle
pkl
plain_text
text
xr_dataarray
xr_dataset
yaml
yaml_to_object
yml
yml_to_object
Hint
To learn about available arguments for these loaders, have a look at the API reference for the corresponding mixin, e.g. starting from the AllAvailableLoadersMixin
.
Missing a loader?
No problem. You can easily specialize your data manager to include a custom loader.
Loading Data#
To load data into the data tree, there are two main methods:
The
load()
method loads a single so-called data entry.The
load_from_cfg()
method loads multiple such entries; thecfg
refers to a set of configuration entries.
For example, having specialized a data manager, data can be loaded in the following way:
import dantro
from dantro.data_loaders import YamlLoaderMixin
class MyDataManager(YamlLoaderMixin, dantro.DataManager):
"""A DataManager specialization that can load YAML data"""
dm = MyDataManager(data_dir=my_data_dir)
# Now, data can be loaded using the `load` command:
dm.load("some_data", # where to load the data to
loader="yaml", # which loader to use
glob_str="*.yml") # which files to find and load
# Access it
dm["some_data"]
# ...
The Load Configuration#
A core concept of dantro is to make a lot of functionality available via configuration hierarchies, which are well-representable using YAML configuration files.
This is also true for the DataManager
, which can be initialized with a certain default load configuration, specifying multiple data entries to load.
When integrating dantro into your project, you will likely be in a situation where the structure of the data you are working with is known and more or less fixed. In such scenarios, it makes sense to pre-define which data you would like to load, how it should be loaded, and where it should be placed in the data tree.
This load configuration can be passed to the DataManager
during initialization using the load_cfg
argument, either as a path to a YAML file or as a dictionary.
When then invoking load_from_cfg()
, these default entries are loaded.
Alternatively, load_from_cfg()
also accepts a new load config or allows updating the default load config.
Example Load Configurations#
In the following, some advanced examples for specific load configurations are shown. These illustrate the various ways in which data can be loaded into the data tree. While most examples use only one single data entry, these can be readily combined into a common load configuration.
The basic setup for all the examples is as follows:
import dantro
from dantro.data_loaders import AllAvailableLoadersMixin
class MyDataManager(AllAvailableLoadersMixin, dantro.DataManager):
"""A DataManager specialization that can load various kinds of data"""
dm = MyDataManager(data_dir=my_data_dir, load_cfg=my_load_cfg)
The examples below are all structured in the following way:
First, they show the configuration that is passed as the
my_load_cfg
parameter, represented as yaml.Then, they show the python invocation of the
load_from_cfg()
method, including the resulting data tree.Finally, they make a few remarks on what happened.
For specific information on argument syntax, refer to the docstring of the load()
method.
Defining a target path within the data tree#
The target_path
option allows more control over where data is loaded to.
my_config_files:
loader: yaml
glob_str: 'config/*.yml'
required: true
# Use information from the file name to generate the target path
path_regex: config/(\w+)_cfg.yml
target_path: cfg/{match:}
dm.load_from_cfg(print_tree=True)
# Will print something like:
# Tree of MyDataManager, 1 member, 0 attributes
# └─ cfg <OrderedDataGroup, 5 members, 0 attributes>
# └┬ combined <MutableMappingContainer, 1 attribute>
# ├ defaults <MutableMappingContainer, 1 attribute>
# ├ machine <MutableMappingContainer, 1 attribute>
# ├ update <MutableMappingContainer, 1 attribute>
# └ user <MutableMappingContainer, 1 attribute>
Remarks:
With the
required
argument, an error is raised when no files were matched byglob_str
.With the
path_regex
argument, information from the path of the files can be used to generate atarget_path
within the tree, using the{match:}
format string. In this example, this is used to drop the_cfg
suffix, which would otherwise appear in the data tree.With a
target_path
given, the name of the data entry (here:my_config_files
) is decoupled from the position where the data is loaded to. Without that argument and the regex, the config files would have been loaded asmy_config_files/combined_cfg
, for example.
Hint
The regular expression in path_regex
is not limited to a single match, it can also have multiple matching groups, which can be unnamed or named groups.
See _prepare_target_path()
for more options and examples.
Combining data entries#
The target_path
option also allows combining data from different data entries, e.g. when they belong to the same measurement time:
# Load the (binary) measurement data for each day
measurement_data:
loader: hdf5
glob_str: measurements/day*.hdf5
required: true
path_regex: measurements/day(\d+).hdf5
target_path: measurements/{match:}/data
# Load the parameter files, containing information about each day
measurement_parameters:
loader: yaml
glob_str: measurements/day*_params.yml
required: true
path_regex: measurements/day(\d+)_params.yml
target_path: measurements/{match:}/params
dm.load_from_cfg(print_tree="condensed")
# Will print something like:
# Tree of MyDataManager, 1 member, 0 attributes
# └─ measurements <OrderedDataGroup, 42 members, 0 attributes>
# └┬ 000 <OrderedDataGroup, 2 members, 0 attributes>
# └┬ params <MutableMappingContainer, 1 attribute>
# └ data <OrderedDataGroup, 3 members, 0 attributes>
# └┬ precipitation <NumpyDataContainer, int64, shape (126,), 0 at…
# ├ sensor_data <OrderedDataGroup, 23 members, 1 attribute>
# └┬ sensor000 <NumpyDataContainer, float64, shape (3, 89), 0 attributes>
# ├ sensor001 <NumpyDataContainer, float64, shape (3, 85), 0 attributes>
# ├ sensor002 <NumpyDataContainer, float64, shape (3, 94), 0 attributes>
# ├ ... ... (18 more) ...
# ├ sensor021 <NumpyDataContainer, float64, shape (3, 80), 0 attributes>
# └ sensor022 <NumpyDataContainer, float64, shape (3, 99), 0 attributes>
# └ temperatures <NumpyDataContainer, float64, shape (126,), 0 attributes>
# ├ 001 <OrderedDataGroup, 2 members, 0 attributes>
# └┬ params <MutableMappingContainer, 1 attribute>
# └ data <OrderedDataGroup, 3 members, 0 attributes>
# └┬ precipitation <NumpyDataContainer, int64, shape (150,), 0 attributes>
# ├ sensor_data <OrderedDataGroup, 23 members, 1 attribute>
# └┬ sensor000 <NumpyDataContainer, float64, shape (3, 99), 0 attributes>
# ├ sensor001 <NumpyDataContainer, float64, shape (3, 85), 0 attributes>
# ├ ...
Loading data as container attributes#
In some scenarios, it is desirable to load some data not as a regular entry into the data tree, but as a container attribute. Continuing with the example from above, we might want to load the parameters directly into the container for each day.
# Load the (binary) measurement data for each day
measurement_data:
loader: hdf5
glob_str: measurements/day*.hdf5
required: true
path_regex: measurements/day(\d+).hdf5
target_path: measurements/{match:}
# Load the parameter files as container attributes
params:
loader: yaml
glob_str: measurements/day*_params.yml
required: true
load_as_attr: true
unpack_data: true
path_regex: measurements/day(\d+)_params.yml
target_path: measurements/{match:}
dm.load_from_cfg(print_tree="condensed")
# Will print something like:
# Tree of MyDataManager , 1 member, 0 attributes
# └─ measurements <OrderedDataGroup, 42 members, 0 attributes>
# └┬ 000 <OrderedDataGroup, 3 members, 1 attribute>
# └┬ precipitation <NumpyDataContainer, int64, shape (165,), 0 attributes>
# ├ sensor_data <OrderedDataGroup, 23 members, 1 attribute>
# └┬ sensor000 <NumpyDataContainer, float64, shape (3, 92), 0 attributes>
# ├ sensor001 <NumpyDataContainer, float64, shape (3, 91), 0 attributes>
# ├ sensor002 <NumpyDataContainer, float64, shape (3, 93), 0 attributes>
# ├ ... ... (18 more) ...
# ├ sensor021 <NumpyDataContainer, float64, shape (3, 83), 0 attributes>
# └ sensor022 <NumpyDataContainer, float64, shape (3, 97), 0 attributes>
# └ temperatures <NumpyDataContainer, float64, shape (165,), 0 attributes>
# ├ 001 <OrderedDataGroup, 3 members, 1 attribute>
# └┬ precipitation <NumpyDataContainer, int64, shape (181,), 0 attributes>
# ├ sensor_data <OrderedDataGroup, 23 members, 1 attribute>
# └┬ sensor000 <NumpyDataContainer, float64, shape (3, 84), 0 attributes>
# ├ sensor001 <NumpyDataContainer, float64, shape (3, 85), 0 attributes>
# ├ ...
# Check attribute access to the parameters
for cont_name, data in dm["measurements"].items():
params = data.attrs["params"]
assert params["day"] == int(cont_name)
Note the 000
group showing one more attribute than in previous examples; this is the params
attribute.
Remarks:
By using
load_as_attr
, the measurement parameters are made available as container attribute and become accessible via itsattrs
property. (This is not to be confused with regular python object attributes.)When using
load_as_attr
, the entry name is used as the attribute name.The
unpack_data
option makes the stored object a dictionary, rather than aMutableMappingContainer
, reducing one level of indirection.
Prescribing tree structure and specializations#
Sometimes, load configurations become easier to handle when an empty tree structure is created prior to loading.
This can be done using the DataManager
‘s create_groups
argument, also allowing to specify custom group classes, e.g. to denote a time series.
# Load the (binary) measurement data for each day
measurement_data:
loader: hdf5
glob_str: measurements/day*.hdf5
required: true
path_regex: measurements/day(\d+).hdf5
target_path: measurements/{match:}
from dantro.groups import TimeSeriesGroup
dm = MyDataManager(data_dir=my_data_dir, out_dir=False,
load_cfg=my_load_cfg,
create_groups=[dict(path="measurements",
Cls=TimeSeriesGroup)])
dm.load_from_cfg(print_tree="condensed")
# Will print something like:
# Tree of MyDataManager , 1 member, 0 attributes
# └─ measurements <TimeSeriesGroup, 42 members, 0 attributes>
# └┬ 000 <OrderedDataGroup, 3 members, 0 attributes>
# └┬ precipitation <NumpyDataContainer, int64, shape (165,), 0 attributes>
# ├ sensor_data <OrderedDataGroup, 23 members, 1 attribute>
# └┬ sensor000 <NumpyDataContainer, float64, shape (3, 92), 0 attributes>
# ├ sensor001 <NumpyDataContainer, float64, shape (3, 91), 0 attributes>
# ├ sensor002 <NumpyDataContainer, float64, shape (3, 93), 0 attributes>
# ├ ...
Remarks:
Multiple paths can be specified in
create_groups
.Paths can also have multiple segments, like
my/custom/group/path
.The
dm['measurements']
entry is now aTimeSeriesGroup
, and thus represents one dimension of the stored data, e.g. theprecipitation
data.
Loading data as proxy#
Sometimes, data is too large to be loaded into memory completely. For example, if we are only interested in the precipitation data, the sensor data should not be loaded into memory.
Dantro provides a mechanism to build the data tree using placeholder objects, so-called proxies. The following example illustrates that, and furthermore uses the dask framework to allow delayed computations.
# Load the (binary) measurement data for each day
measurement_data:
loader: hdf5
glob_str: measurements/day*.hdf5
required: true
path_regex: measurements/day(\d+).hdf5
target_path: measurements/{match:}
load_as_proxy: true
proxy_kwargs:
resolve_as_dask: true
from dantro.containers import XrDataContainer
from dantro.mixins import Hdf5ProxySupportMixin
class MyXrDataContainer(Hdf5ProxySupportMixin, XrDataContainer):
"""An xarray data container that allows proxy data"""
class MyDataManager(AllAvailableLoadersMixin, dantro.DataManager):
"""A DataManager specialization that can load various kinds of data
and uses containers that supply proxy support
"""
# Configure the HDF5 loader to use the custom xarray container
_HDF5_DSET_DEFAULT_CLS = MyXrDataContainer
dm = MyDataManager(data_dir=my_data_dir, out_dir=False,
load_cfg=my_load_cfg)
dm.load_from_cfg(print_tree="condensed")
# Will print something like:
# Tree of MyDataManager , 1 member, 0 attributes
# └─ measurements <OrderedDataGroup, 42 members, 0 attributes>
# └┬ 000 <OrderedDataGroup, 3 members, 0 attributes>
# └┬ precipitation <MyXrDataContainer, proxy (hdf5, dask), int64, shape (165,), 0 attributes>
# ├ sensor_data <OrderedDataGroup, 23 members, 1 attribute>
# └┬ sensor000 <MyXrDataContainer, proxy (hdf5, dask), float64, shape (3, 92), 0 attributes>
# ├ sensor001 <MyXrDataContainer, proxy (hdf5, dask), float64, shape (3, 91), 0 attributes>
# ├ sensor002 <MyXrDataContainer, proxy (hdf5, dask), float64, shape (3, 93), 0 attributes>
# ├ ...
# Work with the data in the same way as before; it's loaded on the fly
total_precipitation = 0.
for day_data in dm["measurements"].values():
total_precipitation += day_data["precipitation"].sum()
Remarks:
By default, the
NumpyDataContainer
andXrDataContainer
classes do not provide proxy support. This is why a custom class needs to be specialized to allow loading the data as proxy.Furthermore, the
DataManager
‘sHdf5LoaderMixin
needs to be told to use the custom data container class.
For details about loading large data using proxies and dask, see Handling Large Amounts of Data.