# Data Transformation Framework#

The uniform structure of the dantro data tree is the ideal starting point to allow more general application of transformation on data. This page describes dantro’s data transformation framework, revolving around the `TransformationDAG` class. It is sometimes also referred to as DAG framework or data selection and transformation framework and finds application in the plotting framework.

This page is an introduction to the DAG framework and a description of its inner workings. To learn more about its practical usage, make sure to look at the Data Transformation Examples.

Related pages:

## Overview#

The purpose of the transformation framework is to be able to generally apply mathematical operations on data that is stored in a dantro data tree. Specifically, it makes it possible to define transformations without touching actual Python code. To that end, a meta language is defined that makes it possible to define almost arbitrary transformations.

In dantro terminology, a transformation is defined as a set consisting of an operation and some arguments. Say, for example, we want to perform a simple addition of two quantities, 1 and 2, we are used to writing `1 + 2`. To define a transformation using the meta language, this would translate to a set consisting of the `add` operation and two (ordered) arguments: `1` and `2`.

Now, typically transformations don’t come on their own and are not nearly as trivial as the above. You might desire to compute `a + b`, where both `a` and `b` are results of previous transformations.

This can be represented as a directed acyclic graph, or short: DAG. For the example above, the graph is rather small:

```  a:(…)   b:(…)
^      ^
\    /
\  /
```

The nodes in this graph represent transformations. These nodes can have labels, e.g. `a` and `b`, which are called references or tags in dantro terminology. As illustrated by the example above, the tags can be used in place of arguments to denote that the result of a previous transformation (with the corresponding label) should be used.

The directed edges in the graph represent dependencies. The acyclic in DAG is required such that the computation of a transformation result does not end in an infinite loop due to a circular dependency.

The `Transformation` and `TransformationDAG` dantro classes implement exactly this structure, making the following features available:

• Easy and generic access to data stored in an associated `DataManager`

• Definition of arbitrary DAGs via dictionary-based configurations

• Syntax optimized to make specification via YAML easy

• Shorthand notations available

• New and custom operations can be registered

• There are no restrictions on the signature of operations

• Caching of transformations is possible, avoiding re-calculation of computationally expensive transformations

• Transformations are uniquely representable by a hash

## The Transformation Syntax#

This section will guide you through the syntax used to define transformations. It will explain the basic elements and inner workings of the mini-language created for the purpose of the DAG.

Note

This explanation goes into quite some detail; and it’s quite important to understand the underlying structures of the If you feel like you would like to jump ahead to see what awaits you, have a look at the Minimal Syntax.

### The `TransformationDAG`#

The structure a user (you!) is mainly interacting with is the `TransformationDAG` class. It takes care to build the DAG by creating `Transformation` objects according to the specification you provided. In the following, all YAML examples will represent the arguments that are passed to the `TransformationDAG` during initialization.

### Basics#

Ok, let’s start with the basics: How can transformations be defined? For the sake of simplicity, let’s only look at transformations that are fully independent of other transformations.

#### Explicit syntax#

The explicit syntax to define a single `Transformation` via the `TransformationDAG` looks like this:

```transform:
args: [1, 2]
kwargs: {}
```

The `transform` argument is the main argument to specify transformations. It accepts a sequence of mappings. Each entry of the sequence contains all arguments that are needed to create a single `Transformation`.

As you see, the syntax is very close to the above definition of what a dantro transformation contains.

Note

The `args` and `kwargs` arguments can also be left out, if no positional or keyword arguments are to be passed, respectively. This is equivalent to setting them to `~` or empty lists / dicts.

#### Specifying multiple transformations#

To specify multiple transformations, simply add more entries to the `transform` sequence:

```transform:
args: [3, 4]
- operation: sub
args: [8, 2]
- operation: mul
args: [6, 7]
```

#### Assigning tags#

Nodes of the DAG all have a unique identifier in the form of a hash string, which is a 32 character hexadecimal string. While it can be used to identify a transformation, the easiest way to refer to it is by using a so-called tag.

Tags are simply plain text pointers to a specific hash, which in turn denotes a specific transformation. To add a tag to a transformation, use the `tag` key.

```transform:
args: [3, 4]
- operation: sub
args: [8, 2]
tag: some_substraction
- operation: mul
args: [6, 7]
```

Note

No two transformations can have the same tag.

In the examples above, all transformations were independent of each other. Having completely independent and disconnected nodes, of course, defeats the purpose of having a DAG structure.

Now let’s look at proper, non-trivial DAGs, where individual transformations use the results of other transformations.

#### Referencing other Transformations#

Other transformations can be referenced in three ways, each with a corresponding Python class and an associated YAML tag:

• `DAGReference` and `!dag_ref`: This is the most basic and most explicit reference, using the transformations’ hash to identify a reference.

• `DAGTag` and `!dag_tag`: References by tag are the preferred references. They use the plain text name specified via the `tag` key.

• `DAGNode` and `!dag_node`: Uses the ID of the node within the DAG. Mostly for internal usage!

Note

When the DAG is built, all references are brought into the most explicit format: `DAGReference` s. Thus, internally, the transformation framework works only with hash references.

The best way to refer to other transformations is by tag: there is no ambiguity, it is easy to define, and it allows you to easily build a DAG tree structure. A simple example with three nodes would be the following:

```transform:
args: [3, 4]
- operation: sub
args: [8, 2]
tag: some_substraction
- operation: mul
args:
- !dag_tag some_substraction
```

Which is equivalent to:

```some_addition = 3 + 4
some_substraction = 8 - 2
```

References can appear within the positional and the keyword arguments of a transformation. As you see, they behave quite a bit like variables behave in programming languages; the only difference being: you can’t reassign a tag and you should not form circular dependencies.

#### Using the result of the previous transformation#

When chaining multiple transformations to each other and not being interested in the intermediate results, it is tedious to always define tags:

```transform:
- operation: mul
args: [1, 2]
tag: f2
- operation: mul
args: [!dag_tag f2, 3]
tag: f3
- operation: mul
args: [!dag_tag f3, 4]
tag: f4
- operation: mul
args: [!dag_tag f4, 5]
tag: f5
```

Let’s say, we’re only interested in `f5`. The only thing we want is that the result from the previous transformation is carried on to the next one. The `with_previous_result` feature can help in this case: It adds as the first positional argument a reference to the previous node. Thus, it is no longer necessary to define a tag.

```transform:
- operation: mul
args: [1, 2]
- operation: mul
args: [3]
with_previous_result: true
- operation: mul
args: [4]
with_previous_result: true
- operation: mul
args: [5]
with_previous_result: true
tag: f5
```

Note that the `args`, in that case, specify one fewer positional argument.

Warning

Using `!dag_node` in your specifications is not recommended. Use it only if you really know what you’re doing.

In case the result of the previous transformation should not be used in place of the first positional argument but somewhere else, there is the `!dag_prev` YAML tag, which creates a node reference to the previous node:

```transform:
- operation: define
args: [10]
- operation: sub
args: [0, !dag_prev ]
- operation: div
args: [1, !dag_prev ]
- operation: power
args: [10, !dag_prev ]
tag: my_result
```

Note

Notice the space behind `!dag_prev`. The YAML parser might complain about a character directly following the tag, like `…, !dag_prev]`.

### Computing Results#

To compute the results of the DAG, invoke the `TransformationDAG`‘s `compute()` method.

It can be called without any arguments, in which case the result of all tagged transformations will be computed and returned as a dict. If only the result of a subset of tags should be computed, they can also be specified.

Computing results works as follows:

1. Each tagged `Transformation` is visited and its own `compute()` method is invoked

2. A cache lookup occurs, attempting to read the result from a memory or file cache.

3. The transformations resolve potential references in their arguments: If a `DAGReference` is encountered, the corresponding `Transformation` is resolved and that transformation’s `compute()` method is invoked. This traverses all the way up the DAG until reaching the root nodes which contain only basic data types (that need no computation).

4. Having resolved all references into results, the arguments are assembled, the operation callable is resolved, and invoked by passing the arguments.

5. The result is kept in a memory cache. It can additionally be stored in a file cache to persist to later invocations.

6. The result object is returned.

Note

Only nodes that are tagged can be part of the results. Intermediate results still need to be computed, but it will not be part of the results dict. If you want an intermediate result to be available there, add a tag to it.

This also means: If there are parts of the DAG that are not tagged at all, they will not be reached by any recursive argument lookup.

Hint

Use the `compute_only` argument of `compute()` to specify which tags are to be computed. If not given, all tags will be computed, unless they start with a `.` or `_` (these are so-called “private” tags).

To compute private tags directly, include them in `compute_only`.

You can also force computation of a node, even if untagged, by adding `force_compute` to the transformation, see below.

Hint

To learn which parts of the computation require the most time, e.g. in order to evaluate whether to cache the result, inspecting the DAG profile statistics can be useful. The `TransformationDAG`‘s `verbosity` attribute controls how extensively statistics are written to the log output. By default (verbosity `1`), only per-node statistics are emitted. For levels `>= 2`, per-operation statistics are shown alongside.

#### Resolving and applying operations#

Let’s have a brief look into how the `operation` argument is actually resolved and how the operation is then applied.

This feature is not specific to the DAG, but the DAG uses the `data_ops` module, which implements a database of available operations and the `apply_operation()` function to apply an operation. Basically, this is a thin wrapper around a function lookup and its invocation.

For a full list of available data operations, see here.

Hint

You can also use the `import` operation to retrieve a callable (or any other object) via a Python import and then use the `call` operation to invoke it. These two operations are combined in the `import_and_call` operation:

```transform:
- operation: import_and_call
args: [numpy.random, randint]
kwargs:
low: 0
high: 10
size: [2, 3, 4]
```

To specifically register additional operations, use the `register_operation()` function. This should only be done for operations that are not easily usable via the `import` and `call` operations.

#### Forcing computation of individual nodes#

Sometimes, it’s useful to force the computation of an individual node. To that end, simply set the `force_compute` option for a transformation:

```transform:
- operation: div
args: [1, 0]
force_compute: true
```

In this example, the node will always be computed, obviously leading to a `ZeroDivisionError`.

A few remarks:

• Force-computed tags are computed before the tags specified in `compute_only`.

• Typical use case is during debugging, where you want to make sure that an operation really is carried out.

• Unlike tagged nodes, their results are not available in the results dict. Even if the node is tagged, it will only appear in the results dict if it is part of `compute_only`.

### Selecting from the `DataManager`#

The above examples are trivial in that they do not use any actual data but define some dummy values. This section shows how data can be selected from the `DataManager` that is associated with the `TransformationDAG`.

The process of selecting data is not different than other transformations. It makes use of the `getitem` operation that would also be used for regular item access, and it uses the fact that the data manager is available via the `dm` tag.

Note

The `DataManager` is also identified by a hash, which is computed from its name and its associated data directory path. Thus, managers for different data directories have different hashes.

#### The `select` interface#

As selecting data from the `DataManager` is a common use case, the `TransformationDAG` supports the `select` argument besides the `transform` argument.

The `select` argument expects a mapping of tags to either strings (the path within the data tree) or further mappings (where more configurations are possible):

```select:
some_data: path/to/some_data
more_data:
path: path/to/more_data
# ... potentially more kwargs
transform: ~
```

The results dict will then have two tags, `some_data` and `more_data`, each of which is the selected object from the data tree.

Note

The above example is translated into the following basic transformation specifications:

```transform:
- operation: getitem
args: [!dag_tag dm, path/to/more_data]
tag: more_data
- operation: getitem
args: [!dag_tag dm, path/to/some_data]
tag: some_data
```

Note that the order of operations is sorted alphabetically by the tag specified under the `select` key.

#### Directly transforming selected data#

Often, it is desired to apply some sequential transformations to selected data before working with it. As part of the `select` interface, this is also possible:

```select:
square_increment:
path: path/to/some_data
with_previous_result: true
transform:
- operation: squared
- operation: increment

some_sum:
path: path/to/more_data
transform:
- operation: getattr
args: [!dag_prev , data]
- operation: sub
args: [0, !dag_prev ]
- operation: .sum
args: [!dag_prev ]
transform:
args: [!dag_tag square_increment, !dag_tag some_sum ]
tag: my_result
```

Notice the difference between `square_increment`, where the result is carried over, and `some_sum`, where the reference has to be specified explicitly. As visible there, within the `select` interface, the `with_previous_result` option can also be specified such that it applies to a sequence of transformations that are based on some selection from the data manager.

Note

The parser expands this syntax into a sequence of basic transformations.

It does so before any other transformations from the `transform` argument are evaluated. Thus, whichever tags are defined there are not available from within `select`!

#### Changing the selection base#

By default, selection happens from the associated `DataManager`, tagged `dm`. This option can be controlled via the `select_base` property, which can be set both as argument to `__init__` and afterward via the property. The property expects either a `DAGReference` object or a valid tag string.

If set, all following `select` arguments are using that reference as the basis, leading to `getitem` operations on that object rather than on the data manager.

As the `select` arguments are evaluated before any transform operations, only the default tags are available during initialization. To widen the possibilities, the `TransformationDAG` allows the `base_transform` argument during initialization; this is just a sequence of transform specifications, which are applied before the `select` argument is evaluated, thus allowing to select some object, tag it, and use that tag for the `select_base` argument.

Note

The `select_path_prefix` argument offers similar functionality, but merely prepends a path to the argument. If possible, the `select_base` functionality should be preferred over `select_path_prefix` as it reduces lookups and cooperates more nicely with the file caching features.

Background Information

Internally, when the `select` specification is evaluated, it is set to select against a special tag `select_base`; by default, this is the same as the `dm` special tag.

Effectively, the `select` feature always selects starting from the object the `select_base` property points to at the time the nodes are added to the DAG. In other words, if the `select_base` is changed after the nodes were added, this will not have any effect.

For meta-operations this means that the base of selection is not relevant at definition of the meta-operations; the base gets evaluated when the meta-operation is used.

### The `define` interface#

So far, we have seen two ways to add transformation nodes to the DAG: via `transform` or via `select`. These are based either on directly adding the nodes, giving full control, or adding transformations based on a selection of data.

The `define` interface is a combintion of these two approaches: same as `select`, it revolves around the final tag that’s meant to be attached to the definition, but it does not require a data selection like `select` does.

Let’s look at an example that combines all these ways of adding transformations:

```define:
exponent: 4               # directly define some object
days_to_seconds_factor:   # use a sequence of TransformationsDAG
- expression: "60 * 60 * 24"
- float
select:
some_data: path/to/some_data
more_data: path/to/more_data
transform:
- add: [!dag_tag some_data, !dag_tag more_data]
- mul: [!dag_prev , !dag_tag days_to_seconds_factor]
- print
- pow: [!dag_prev , !dag_tag exponent]
tag: my_result
```

Here, the `exponent` as well as some conversion factor tags are defined not ad-hoc but separately via the `define` interface. As can be seen in the example, there are two ways to do this:

• If providing a `list` or `tuple` type, it is interpreted as a sequence of transformations, accepting the same syntax as `transform`. After the final transformation, another node is added that sets the specified tag, `days_to_seconds_factor` in this example.

• If prodiving any other type, it is interpreted directly as a definition, adding a single transformation node that holds the given argument, the integer `4` in the case of the `exponent` tag.

Note

The `define` argument is evaluated before the other two. Subsequently, tags defined via `define` can be used within `select` or `transform`, but not the other way around.

Hint

In the context of plotting, the `define` interface has an important benefit over the `select` and `transform` syntax for adding nodes to the DAG: It is dictionary-based, which makes it very easy to recursively update its content; this is very useful for Plot Configuration Inheritance.

⚠️ Note: When using the DAG for plot data selection, not all arguments can be exposed on the top-level of the plot configuration; the `define` argument is one of the arguments that is nested in a DAG-specific config entry `dag_options`:

```my_plot:
dag_options:
define:
exponent: 4
# ...

select:
# ...
transform:
- # ...
```

Nodes can be added to `TransformationDAG` during initialization; all the examples above are written in that way. However, transformation nodes can also be added after initialization using the following two methods:

### Minimal Syntax#

To make the definition a bit less verbose, there is a so-called minimal syntax, which is internally translated into the explicit and verbose one documented above. This can make DAG specification much easier:

```select:
some_data: path/to/some_data
more_data: path/to/more_data
transform:
- add: [!dag_tag some_data, !dag_tag more_data]
- increment
- print
- pow: [!dag_prev , 4]
tag: my_result
```

This DAG will have three custom tags defined: `some_data`, `more_data` and `my_result`. Computation of the `my_result` tag is equivalent to:

```my_result = ((some_data + more_data) + 1) ** 4
```

As can be seen above, the minimal syntax gets rid of the `operation`, `args` and `kwargs` keys by allowing to specify it as `<operation name>: <args or kwargs>` or even as just a string `<operation name>`, without further arguments.

#### With arguments, `<operation name>: <args or kwargs>`#

When passing a sequence (e.g. `[foo, bar]`) the arguments are interpreted as positional arguments; when passing a mapping (e.g. `{foo: bar}`), they are treated as keyword arguments.

Hint

In this shorthand notation it is still possible to specify the respective “other” types of arguments using the `args` or `kwargs` keys. For example:

```transform:
- my_operation: [foo, bar]
kwargs: { some: more, keyword: arguments }
- my_other_operation: {foo: bar}
args: [some, positional, arguments]
```

#### Without arguments, `<operation name>`#

When specifying only the name of the operation as a string (e.g. `increment` and `print`), it is assumed that the operation accepts only a single positional argument and no other arguments. That argument is automatically filled with a reference to the result of the previous transformation, i.e.: the result is carried over.

For example, the above transformation with the `increment` operation would be translated to:

```operation: increment
args: [!dag_prev ]
kwargs: {}
tag: ~
```

### Operation Hooks#

The DAG syntax parser allows attaching additional parsing functions to operations, which can help to supply a more concise syntax. These so-called operation hooks are described in more detail here. As an example, the `expression` operation can be specified much more conveniently with the use of its hook. Taking the example from above, the same can be expressed as:

```select:
some_data: path/to/some_data
more_data: path/to/more_data
transform:
- expression: (some_data + more_data + 1) ** 4
tag: my_result
```

In this case, the hook automatically extracts the free symbols (`some_data` and `more_data`) and translates them to the corresponding `DAGTag` objects. Effectively, it parses the above to:

```select:
some_data: path/to/some_data
more_data: path/to/more_data
transform:
- expression: (some_data + more_data + 1) ** 4
kwargs:
symbols:
some_data: !dag_tag some_data
more_data: !dag_tag more_data
tag: my_result
```

If you care to deactivate a hook, set the `ignore_hooks` flag for the operation:

```operation: some_hooked_operation
args: [foo, bar]
ignore_hooks: true
```

Warning

Failing operation hooks will emit a logger warning, informing about the error; they do not raise an exception. While this might not lead to a failure during parsing, it might lead to an error during computation, e.g. when you are relying on the hook to have adjusted the operation arguments.

Depending on the operation arguments, there can be cases where the hook will not be able to perform its function because it lacks information that is only available after a computation. In such cases, it’s best to deactivate the hook as described above.

### Graph representation and visualization#

The `TransformationDAG` has the ability to represent the internally used directed acyclic graph as a `networkx.classes.digraph.DiGraph`. By calling the `generate_nx_graph()` method, the `Transformation` objects are added to a graph and the dependencies between these transformations are added as directed edges.

This can help to better understand the generated DAG and is useful not only for debugging but also for optimization, as it allows to show the associated profiling information.

Hint

It can be configured whether the edges should represent the “flow” of results through the DAG (edges pointing towards the node that requires a certain result) or whether they should point towards a node’s dependency.

By default, `generate_nx_graph()` has `edges_as_flow` set to `True`, thus having edges point in the effective direction of computation.

#### Visualization#

In addition to generating the graph object, the `visualize()` method can generate a visual output of the DAG:

In this example, the `- my_result -` node is tagged at the bottom and the arrows come from the transformations that these operations depend on. Effectively, calculation starts at the top, with data being read from the `dm` node, the associated `DataManager`, then following the arrows towards the `my_result` node and applying the specified operations like `squared`, `increment` and so on.

The circles in the background show the status of the computation, green meaning that a node’s result was computed as expected; other colors and their corresponding status are detailed in the legend. The node status can indicate where in a DAG computation routine an error occurred. To control this, have a look at the `show_node_status` argument and the `annotation_kwargs`, where the legend can be controlled.

Note

Operation arguments cannot easily be shown as it would quickly become too cluttered. For that reason, the visualization typically restricts itself to showing the operation name, the result (if computed), and the tag (if set).

See `visualize()` for more info.

Hint

If using the data transformation framework for plot data selection, visualization is deeply integrated there; see DAG Visualization.

Hint

DAG visualization works much better with pygraphviz installed, because it gives access to more capable layouting algorithms.

#### Export#

To post-process the DAG data elsewhere, use the standalone `export_graph()` function.

### Full syntax specification of a single transformation node#

To illustrate the possible arguments for creating a transformation node via `add_node()`, the following block contains a full specification of available keys and arguments. It is a combination of arguments to `Transformation` and arguments that are handled by `TransformationDAG`, which is aware of the whole DAG.

Note that this is the explicit representation, which is a bit verbose. Except for `operation`, `args`, `kwargs` and `tag`, all entries are set to default values.

```operation: some_operation       # The name of the operation
args:                           # Positional arguments
- !dag_tag some_result        # Reference to another result
- second_arg
kwargs:                         # Keyword arguments
one_kwarg: 123
another_kwarg: foobar
salt: ~                         # Is included in the hash; set a value here
# if you would like to provoke a cache miss
fallback: ~                     # May only be given if ``allow_failure`` is
# also set, in which case it specifies a
# fallback value (or reference) to use
# instead of the operation result.

# All arguments _below_ are NOT taken into account when computing the hash
# of this transformation. Two transformations that differ _only_ in the
# arguments given below are considered equal to each other.

tag: my_result                  # The tag of this transformation. Optional.
force_compute: ~                # Used to force computation of this node
# without needing to assign a tag.

allow_failure: ~                # Whether to allow this transformation to
# fail during computation or resolution of
# the arguments (i.e.: upstream error).
# Special options are: log, warn, silent

memory_cache: True              # If False, will not keep the computed
# result in memory but either re-compute it
# or load it from the file cache.

file_cache:                     # File cache options
enabled: false              # Whether to read from the file cache
always: false               # If true, will always read from the file
# cache, regardless of whether the result
# was already stored in the memory cache
# or just computed.
write:                        # Write-related options
enabled: false              # Whether to write to the file cache

# If writing is enabled, the following options determine whether a
# cache file should actually be written (does not always make sense)
always: false               # If true, forces writing
allow_overwrite: false      # If false, will not write if a cache file
min_size: ~                 # If given, the result needs to have at
# least this size (in bytes) for it to be
# written to a cache file.
max_size: ~                 # Like min_size, but upper boundary
min_compute_time: ~         # If given, a cache file is only written
# if the computation time of this node on
# its own, i.e. without the computation
# time of the dependencies exceeded this
# value.
min_cumulative_compute_time: ~  # Like min_compute_time, but actually
# taking into account the time it took to
# compute results of the dependencies.

# Options used when storing a result in the cache
storage_options:
raise_on_error: false     # Whether to raise if saving failed
attempt_pickling: true    # Whether to attempt pickling if saving
# via a specific save function failed
pkl_kwargs: {}            # Passed on to pkl.dumps
ignore_groups: true       # Whether to attempt storing dantro groups
# ... additional arguments passed on to the specific saving function
```

Note

This does not reflect any arguments made available by the DAG parser! Features like the minimal syntax or the operation hooks are handled prior to the initialization of a `Transformation` object.

Hint

Often the easiest way to learn is by example. Make sure to check out the Data Transformation Examples page, where you will find practical examples that go beyond what is shown here.

## Meta-Operations#

In essence, the transformation framework, as described above, can be used to define a sequence of data operations, just like a sequential program code could do. Now, what if parts of these operations are used multiple times? In a typical computer program, one would define a function to modularize part of the program. The equivalent construct in the data transformation framework is a so-called meta-operation, which can be characterized in the following way:

• It can have input arguments that define which objects it should work on

• It consists of a number of operations that transform the arguments in the desired way

• It has one (and only one) output, the return value

How are meta-operations defined? Meta-operations can be defined in just the same way as regular transformations are defined, with some additional syntax for defining positional arguments (`args`) and keyword arguments (`kwargs`). Let’s look at an example:

```# Define meta-operations
meta_operations:
# Compute (x**2 + 1) for a positional input argument x
square_plus_one:
- pow: [!arg 0, 2]
- add: [!dag_prev , 1]   # <-- this operation's result is the "return value"

# Select some data and directly compute its mean
select_and_compute_mean:
select:
data:
path: !kwarg to_select
transform:
- .mean: !dag_tag data
```

This defines two meta-operations: `square_plus_one` (with one positional argument) and `select_and_compute_mean` (with the `to_select` keyword argument). During initialization of `TransformationDAG`, these can be passed using the `meta_operations` argument.

How are meta-operations used? In exactly the same way as all regular data operations: simply define their name as the `operation` argument of a transformation.

```# Use the meta-operations within the regular data transformation, alongside
transform:
- select_and_compute_mean:
to_select: path/to/some_data
tag: some_data_mean
- select_and_compute_mean:
to_select: path/to/more_data
tag: more_data_mean
- add: [!dag_tag some_data_mean, !dag_tag more_data_mean]
- square_plus_one: !dag_prev
tag: result
```

Here, one of the meta-operations is used to compute two mean values from a selection; these are then added together via the regular `add` operation; finally, the other meta-operation is applied to that sum, yielding the result. While the individual meta-operations are not complex in themselves, this illustrates how repeatedly invoked transformations can be modularized.

Note

The examples in these sections use the `meta_operations` top-level entry to illustrate the definition of meta-operations. The `transform` and/or `select` top-level entries are used to denote how meta-operations can be invoked (in the same way as regular operations).

As a brief summary:

• Meta-operations are defined via the `meta_operations` argument of `TransformationDAG`, using the same syntax as for other transformations.

• They can specify positional and keyword arguments and have a return value.

• They can be used for the `operation` argument of any transformation, same as other available data operations.

• Meta-operations allow modularization and thereby simplify the definition of data transformations.

Hint

To use meta-operations for plot data selection, define them under the `dag_options.meta_operations` key of a plot configuration.

### Defining meta-operations#

The example above already gave a glimpse into how to define meta-operations. In many ways, this works exactly the same as defining transformations, e.g. under the `transform` argument.

#### Specifying arguments#

Like Python functions, meta-operations can have two kinds of arguments:

• Positional arguments, defined using the `!arg <position>` YAML tag

• Keyword arguments, defined using the `!kwarg <name>` YAML tag

These can be used anywhere inside the meta-operation specification and serve as placeholders for expected arguments. Let’s look at an example:

```meta_operations:
my_equation:  # [(a + b) * c - d] / e
- add: [!kwarg a, !kwarg b]
- mul: [!dag_prev , !kwarg c]
- sub: [!dag_prev , !kwarg d]
- div: [!dag_prev , !kwarg e]

transform:
- my_equation:
a: 1
b: 10
c: 8
d: 4
e: 2
```

When the meta-operation gets translated into nodes, the corresponding positional and keyword arguments are replaced with the values from the `args` and `kwargs` of the transformation specification.

Some remarks:

• Positional and keyword arguments can be mixed

• Arguments can be referred to multiple times within a meta-operation definition

• The set of positional arguments, if specified, needs to include all integers between zero and the highest defined `!arg <position>`.

• Optional arguments and variable positional or keyword arguments are not supported (yet).

#### Return values#

Meta-operations always have one and only one return value: the last defined transformation.

Hint

To have “multiple” return values, e.g. to return an intermediate result, aggregate objects into a `dict` that can then be unpacked outside of the meta-operation. For an example, see Aggregate return values.

#### Using `select` within meta-operations#

The definitions inside `meta_operations` can have two possible formats:

```meta_operations:
# A -- as list  ==> transformations only
a_plus_b_cubed:
- add: [!arg 0, !arg 1]
- pow: [!dag_prev , 3]

# B -- as dict  ==>  selections _and_ transformations
select_and_square:
select:
data:
path: !kwarg to_select
transform:
- squared: !dag_tag data

transform:
- a_plus_b_cubed: [1, 2]
tag: result1
- select_and_square:
to_select: path/to/some_data
tag: result2
```

As can be seen above, the dict-based definition supports using the select interface. Importantly, this supports parametrization: simply use `!arg` or `!kwarg` inside the `select` specification, e.g. to make the path of the to-be-selected object an argument to the meta-operation.

#### Internal tags#

When defining simple meta-operations, passing the output of the previous operation through to the next one using `!dag_prev` usually suffices to connect operations. Such meta-operations are essentially linear DAGs.

However, to define non-linear meta-operations (or: general DAGs), it needs to be possible to use the result of any previously specified transformation. For that purpose, the `tag` entry and the `!dag_tag` YAML tag can be used, same as in the usual specification of references between transformations:

```meta_operations:
my_meta_operation:  # [(x+1) * (x-1)] / (2*y)
tag: left
tag: right
- mul: [!dag_tag left, !dag_tag right]
tag: top
- mul: [!arg 1, 2]
- div: [!dag_tag top , !dag_prev ]

transform:
- my_meta_operation: [9, 2]  # [(9+1) * (9-1)] / (2*2) == 20
tag: result
```

Internal tags are all `tag` definitions inside the `meta_operation` definition. These tags are solely accessible within the meta-operation and will not be available as results later on (only the return value will). In the above example, the `left`, `right`, and `top` tags are internal tags and they are referenced using the already-known `!dag_tag` YAML tag.

This is in contrast to the regular `tag` definitions (the `result` tag in the example), which is a regular tag. Effectively, the regular tag is attached to the last transformation of the meta-operation, here being the `div` operation.

Note

In order to avoid silent errors and reduce unexpected behaviour, all internally defined tags need to be used within the meta-operation.

#### Argument default values#

Meta operation arguments `!arg` and `!kwarg` can also have default values. These are defined by passing a list of length 2 to the YAML tags (instead of a scalar number for positional arguments or name for keyword arguments).

For instance, if you want an optional keyword argument `foo`, define it as:

```!kwarg [foo, my_default_value]
```

Equivalently for positional arguments:

```!arg [0, my_default_value]
```

Let’s look at an example where the `my_increment` meta-operation would increment by one per default or by some other value, if desired:

```meta_operations:
my_increment:
- add: [!arg 0, !arg [1, 1]]

transform:
- my_increment: [0]
tag: one
- my_increment: !dag_prev
tag: two
- my_increment: [!dag_prev , 8]
tag: ten
```

The above meta-operation is equivalent to the following Python function with one required positional-only argument and one optional positional-only argument:

```def my_increment(x, delta = 1, /):
return x + delta

one = my_increment(0)
two = my_increment(one)
ten = my_increment(two, 8)
```

For a larger example that is using keyword arguments, see below.

Hint

Default values need not be scalar, they can be anything — as long as they do not contain any `Placeholder` objects like tags, references, or other argument definitions.

Warning

To clearly distinguish which arguments are optional and which ones are required, make sure that any `!arg` or `!kwarg` with a default value has a default value for all those occurrences of the arguments in your meta-operation:

There should never be a `!arg [0, 42]` and `!arg 0` in your meta-operation at the same time.

### Examples#

#### `prime_multiples`#

The following example performs operations on the arguments and then uses internal tags (`!dag_tag`) to connect their output to a result.

```meta_operations:
prime_multiples:
# Define powers of primes 2, 3, 5, and 7
- pow: [2, !kwarg [base2, 0]]
tag: b2
- pow: [3, !kwarg [base3, 0]]
tag: b3
- pow: [5, !kwarg [base5, 0]]
tag: b5
- pow: [7, !kwarg [base7, 0]]
tag: b7
# Compute their product
- np.: [prod, [!dag_tag b2, !dag_tag b3, !dag_tag b5, !dag_tag b7]]

transform:
- prime_multiples:
base2: 2
base3: 1
# base5: 0
base7: 3
tag: result
```

As can be seen in the following plot, the meta-operation is unpacked into individual transformation nodes:

Hint

The DAG visualization also shows which operation originated from which meta-operation (in parentheses below the operation name). Here, all originate from `prime_multiples`.

#### Aggregate return values#

In this example, a `dict` operation is used to return multiple results from a meta-operation.

```meta_operations:
# Given some input data (as positional argument), compute a bunch of
# statistical quantities. To return them, aggregate them into a dict.
compute_stats:
# Make sure it's an xarray object
- xr.DataArray: !arg 0
tag: data

# Compute the statistics
- .mean: !dag_tag data
tag: mean
- .std: !dag_tag data
tag: std
- .median: !dag_tag data
tag: median
- .min: !dag_tag data
tag: min
- .max: !dag_tag data
tag: max
- .quantile: [!dag_tag data, 0.25]
tag: q25
- .quantile: [!dag_tag data, 0.75]
tag: q75

# Aggregate into a dict as return value
- dict:
mean: !dag_tag mean
std: !dag_tag std
median: !dag_tag median
min: !dag_tag min
max: !dag_tag max
q25: !dag_tag q25
q75: !dag_tag q75

# Usage example: select some data and get some of the desired statistics
select:
some_data: path/to/some_data
transform:
- compute_stats: !dag_tag some_data
tag: some_stats

- getitem: [!dag_tag some_stats, mean]
tag: some_mean
- getitem: [!dag_tag some_stats, std]
tag: some_std
- getitem: [!dag_tag some_stats, median]
tag: some_median
```

Note that by aggregating results into an object, the DAG will not be able to discern whether a branch of the `compute_stats` meta-operation is actually needed, thus potentially computing more results than required. In order to avoid computing more nodes than necessary, aggregated return values should be used sparingly; ideally, use them only to return an intermediate result.

This packing and unpacking can also be observed in the DAG plot:

#### `my_gauss`#

This example shows how to define a mathematical expression (also see: operation hooks) and exposing its symbols as arguments of the meta-operation:

```meta_operations:
# A meta-operation that defines a gaussian
my_gauss:
- expression: a * exp(- (x - mu)**2 / (2 * sigma**2))
kwargs:
symbols:
x: !kwarg x
a: !kwarg a
mu: !kwarg mu
sigma: !kwarg sigma

transform:
# Compute the Gaussian for two values
- my_gauss:
a: 1.
mu: 0.
sigma: 1.
x: 0.
tag: default_gaussian
- my_gauss:
a: 1.
mu: 23.
sigma: 10.
x: 23.
tag: wide_gaussian_moved
```

For this case, it makes a lot of sense to use default values for meta-operation arguments, thus reducing the number of keyword arguments that need to be specified:

```meta_operations:
# A meta-operation that defines a gaussian
my_gauss:
- expression: a * exp(- (x - mu)**2 / (2 * sigma**2))
kwargs:
symbols:
x: !kwarg x
a: !kwarg [a, 1.]
mu: !kwarg [mu, 0.]
sigma: !kwarg [sigma, 1.]

transform:
# Compute the Gaussian for two values
- my_gauss:
x: 0.
tag: default_gaussian
- my_gauss:
x: 23.
a: 1.
mu: 23.
sigma: 10.
tag: wide_gaussian_moved
```

Hint

If you do not want to define default arguments, e.g. because you want to control the shared defaults via some YAML-based logic, you can also reduce the number of repeated arguments using YAML anchors and inheritance:

```transform:
- my_gauss: &my_gauss_defaults    # <-- defines the defaults
a: 1.
mu: 0.
sigma: 1.
x: 0.
tag: default_gaussian
- my_gauss:
<<: *my_gauss_defaults        # <-- re-use defaults ...
a: 10.                        #     ... and update with new values
tag: scaled_gaussian
- my_gauss:
<<: *my_gauss_defaults
mu: -42.
tag: moved_gaussian
```

### Remarks & Caveats#

Note the following remarks regarding the definition and use of meta-operations:

• Inside meta-operations, no outside tags except the “special” tags (`dag`, `dm`, `select_base`) can be used. Further inputs should be handled by adding arguments to the meta-operation as described above.

• When using the `select` syntax in the definition of a meta-operation and aiming to define an argument, note that the long syntax needs to be used:

```select:
# Correct
some_data:
path: !kwarg some_data_path

# WRONG! Will not work.
other_data: !kwarg other_data_path
```
• When defining a meta-operation and using an operation that makes use of an operation hook, the tags created by the hook need to be explicitly exposed as arguments, otherwise there will be an `Unused tags ...` error. To expose them, there are two ways:

• Use them internally by adding a `define`, `dict`, or `list` operation prior to the operation that uses the hook; then explicitly specify them as arguments there.

• In the case of the `expression` operation hook, use the `kwargs.symbols` entry to directly define them as arguments, as done in the `my_gauss` example above.

• A meta-operation always adds a so-called “result node”, which uses the `pass` operation to make the result of the meta-operation available. When using a meta-operation, the arguments `tag` and `file_cache` (see below) as well as any error handling arguments are added only to this result node. For all other transformation nodes of a meta-operation, the following holds:

• They may have only internal tags attached

• They may define their own `file_cache` behavior; if they do not, the default values for file caching are used.

• They are free to define their own error handling behavior.

## Error Handling#

Operations are not always guaranteed to succeed. To define more robust operations, some form of error handling is required, akin to `try-except` blocks in Python.

In the data transformation framework, the `allow_failure` option handles failing data operations and allows to specify a `fallback` value that should be used as result in case the operation failed. Let’s have a look:

```  - float: "inf"
- div: [1, 0]               # 1 / 0  -->  raises ZeroDivisionError
allow_failure: true
fallback: !dag_prev
tag: result
```

Here, the `ZeroDivisionError` is avoided and, instead, the value of the previous node (which defines a float infinity value) is used. Subsequently, the `result` will be the Python floating-point `inf`.

Note

The `allow_failure` argument also accepts a few string-like values which control the verbosity of the output in case of failure:

• `log` does the same as `True`: print a prominent log message that informs about the failed operation and the use of the fallback.

• `warn`: emits a Python warning

• `silent`: suppresses the message altogether

Example:

```  - float: "inf"
- div: [1, 0]
allow_failure: silent     # can also be: True, log, warn, False
fallback: !dag_prev
```

For debugging, make sure to not use `silent`.

Hint

The `fallback` argument accepts not only scalars, but also sequences or mappings, which in turn may contain `!dag_tag` references.

### Upstream errors#

Sometimes, an error only becomes relevant in a later operation and it makes sense to defer error handling to that point. The analogy to Python exception handling would be to handle the error not directly where it occurs but in an outside scope.

This is also possible within the error handling framework, because `allow_failure` pertains to both the computation of the specified operation as well as the resolution of its arguments. As the resolution of arguments triggers the computation of dependent nodes (and their dependencies, and so forth), an upstream error may also be caught in a downstream node:

```  # Example input: assume that this may also be the output from previous
# operations which are used to calculate something else ...
- define: -1.23
tag: some_value
- define: +1
tag: some_other_value

# Perform some potentially problematic operations with these ...
- import_and_call: [math, log10, !dag_tag some_value]   # --> ValueError
tag: log10_value

- import: [np, pi]
tag: pi
- sub: [!dag_tag some_other_value, 1.]
- div: [!dag_tag pi, !dag_prev ]                        # --> ZeroDivisionError
tag: pi_over_some_other_value

# ... leading to the result
- add: [!dag_tag log10_value, !dag_tag pi_over_some_other_value]
allow_failure: true
fallback: 42
tag: my_result
```

In this example, the nodes tagged `log10_value` and `pi_over_some_other_value` are both problematic but do not specify any error handling. However, we may only be interested in `my_result`, which depends on those two transformation results. Let’s say, we specified `compute_only: [my_result]`. What would happen in such a case?

• The transformation tagged `my_result` is looked up in the DAG.

• The transformation’s arguments are recursively resolved, triggering lookup of the dependencies `log10_value` and `pi_over_some_other_value`.

• The referenced transformations would in turn look up their arguments and finally lead to the application of the problematic operations (`div` and `math.log10`), which will fail for the arguments in this example.

• An error is raised during those operations.

• The error propagates back to the `my_result` transformation.

• With `allow_failure: true`, the error is caught and the fallback value is used instead.

Warning

The above example only works with `compute_only: [my_result]`. If the problematic tags were to be computed directly, e.g. via `compute_only: all`, they would raise an error because they do not specify any error handling themselves.

Note

This example is purely for illustration! Typically, one would define these operations using numpy and they would not raise exceptions but issue a `RuntimeWarning` and use `nan` as result.

### Error handling within `select`#

The `select` operation may also specify a fallback. This fallback will only be applied to the `getitem` operation which is used to look up the `path` from the specified selection base:

```select:
some_data: path/to/some_data
mean_data:
path: some/invalid/path       # The underlying `getitem` will fail ...
allow_failure: true           # ... but is allowed to.
fallback: [[1, 2, 3]]         # Instead, this fallback value is used.
transform:
- np.mean                   # ... which still works for a mean

transform:
- expression: (some_data + mean_data + 1) ** 4
tag: my_result
```

Hint

The `transform` elements can of course again specify their own fallbacks.

#### Limitations#

There are some limitations to using `allow_failure` within `select`. Mainly, specifying a fallback may be difficult in practice because other tags may not be available yet at the time where the DAG is populated with the `select` arguments.

The tags specified by `select` are added in alphabetical order and before any transformations from `transform` are added to the DAG. Subsequently, lookups within one `select` field are only possible from within `select` and for fields that appeared sooner in that alphabetical order. (See this issue for a potential improvement to this behavior.)

Using a tagged reference in the `fallback` works in the following example because `'_some_fallback_data' < 'mean_data'`:

```select:
_some_fallback_data: path/to/some_data
mean_data:
path: some/invalid/path
allow_failure: true
fallback: !dag_tag _some_fallback_data
transform:
- np.mean

transform:
- expression: (mean_data + 1) ** (-0.5)
tag: my_result
```

Hint

We advise to not build overly complex fallback structures within `select`, e.g. using tagged fallbacks which in turn have tagged fallbacks and so forth. While possible, it may easily becomes tedious to build or maintain.

If you require more advanced error handling for certain operations, consider wrapping them into your own data operation. See Resolving and applying operations for more information.

## The File Cache#

Caching of already computed results is a powerful feature of the `TransformationDAG` class. The idea is, that if some specific computationally expensive transformation already took place previously, it should not be necessary to compute it again.

### Background#

To understand the file cache, it’s first necessary to understand the internal handling of transformations.

Within the DAG, each transformation is fully identified by its hash. If the hashes of two transformations are the same it means the operation is the same and all arguments are the same.

All `Transformation` objects are stored in an `objects` database, which maps a hash to a Python object. In effect, there is one and only one `Transformation` object associated with a certain hash.

Say, a DAG contains two nodes, N1 and N2, with the same hash. Then the object database contains a single transformation T, which is used in place of both nodes N1 and N2. Thus, if the result of one of the nodes is computed, the other should already know the result and not need to re-compute it.

That is what is called the memory cache: once a result is computed, it stays in memory, such that it need not be recomputed again. This is useful not only in the above situation but also when doing DAG traversal during computation.

The file cache is not much different than the memory cache: it aims to make computation results persist to reduce computational resources. With the file cache, the results can persist over multiple invocations of the transformations framework.

### Configuration#

#### Cache directory#

Cache files need to be written in some place. This can be specified via the `cache_dir` argument during the initialization of a `TransformationDAG`; see there for details.

By default, the cache directory is called `.cache` and is located inside the data directory associated with the DAG’s DataManager. It is created once it is needed.

#### Default file cache arguments#

File cache behavior can be configured separately for each `Transformation`, as can be seen from the full syntax specification above.

However, it’s often useful to have default values defined that all transformations share. To do so, pass a dict to the `file_cache_defaults` argument. In the simplest form, it looks like this:

```file_cache_defaults:
write: true
transform:
- # ...
```

This enables both reading from the cache directory and writing to it. When passing booleans, to `read` and `write`, the default behavior is used. To more specifically configure the behavior, again see the full syntax specification above.

When specifying additional `file_cache` arguments within `transform`, the values specified there recursively update the ones given by `file_cache_defaults`.

Note

The `getitem` operations defined via the `select` interface always have caching disabled; it makes no sense to cache objects that have been looked up directly from the data tree.

Warning

The file cache arguments are not taken into account for computation of the transformations’ hash. Thus, if there are two transformations with the same hash, only the additional file cache arguments given to the first one are taken into account; the second ones have no effect because the second transformation object is discarded altogether.

Warning

If it is desired to have two transformations with different file cache options, the `salt` can be used to perturb its hash and thus force the use of the additional file cache arguments.

#### Reading from the file cache#

Generally, the best computation is the one you don’t need to make. If there is no result in memory and reading from cache is enabled, the cache directory is searched for a file that has as its basename the hash of the transformation that is to be computed.

If that is the case, the DataManager is used to load the data into the data tree and set the memory cache. (Note that this is Python, i.e. it’s not a copy but the memory cache is a reference to the object in the data tree.)

By default, it is not attempted to read from the cache directory. See above on how to enable it.

Note

When desiring to use the caching feature of the transformation framework, the employed `DataManager` needs to be able to load numerical data. If you are not already using the `AllAvailableLoadersMixin`, consider adding `NumpyLoaderMixin`, `XarrayLoaderMixin`, and `PickleLoaderMixin` to your `DataManager` specialization.

Hint

Sometimes it can be desired to always read from the file cache, e.g. to make use of the `load_options` argument. In that case, set the following arguments to make sure that a cache file will be written after a computation.

```file_cache:
enabled: true
always: true
chunks: true
write: true
```

Note that the computed result may still remain in the memory cache. See `Transformation` on how to not keep it in memory.

#### Writing to the file cache#

After a computation result was either looked up from the cache or computed, it can be stored in the file cache. By default, writing to the cache is not enabled, either. See above on how to enable it.

When writing a cache file, many options can trigger that a transformation’s result is written to a file. For example, it might make sense to store only results that took a very long time to compute or that are very large.

Once it is decided that a result is to be written to a cache file, the corresponding storage function is invoked. It creates the cache directory, if it does not already exist, and then attempts to save the result object using a set of different storage functions.

There are specific storage functions for numerical data: numpy arrays are stored via the `numpy.save` function, which is also used to store `NumpyDataContainer` objects. Another specific storage function takes care of `xarray.DataArray` and `XrDataContainer` objects.

If there is no specific storage function available, it is attempted to pickle the object.

Note

It is not currently possible to store `BaseDataGroup`-derived objects in the file cache.

### Remarks#

• The structure of the DAG – a Merkle tree, or: hash tree – ensures that each node’s hash depends on all parent nodes’ hashes. Thus, all downstream hashes will change if some early operation’s arguments are changed.

• The transformation framework can not distinguish between arguments that are relevant for the result and those who might not; all arguments are taken into account in computing the hash.

• It might not always make sense to read from or write to the cache, depending on how long it took to compute, how much data is to be stored and loaded and how long that takes.

• Dividing up large transformations into many small transformations will increase the possibility of cache hits; however, this also increases the memory footprint of the DAG by potentially requiring more memory for intermediate objects and more read/write operations to the file cache.

• There may never be more than one file in the cache directory that has the same basename (i.e.: hash) as another file. Such situations need to be resolved manually by deleting all but one of the corresponding files.

• There is no harm in just deleting the cache directory, e.g. when it gets too large.