Data Transformation Framework#

The uniform structure of the dantro data tree is the ideal starting point to allow more general application of transformation on data. This page describes dantro’s data transformation framework, revolving around the TransformationDAG class. It is sometimes also referred to as DAG framework or data selection and transformation framework and finds application in the plotting framework.

This page is an introduction to the DAG framework and a description of its inner workings. To learn more about its practical usage, make sure to look at the Data Transformation Examples.

Related pages:


Overview#

The purpose of the transformation framework is to be able to generally apply mathematical operations on data that is stored in a dantro data tree. Specifically, it makes it possible to define transformations without touching actual Python code. To that end, a meta language is defined that makes it possible to define almost arbitrary transformations.

In dantro terminology, a transformation is defined as a set consisting of an operation and some arguments. Say, for example, we want to perform a simple addition of two quantities, 1 and 2, we are used to writing 1 + 2. To define a transformation using the meta language, this would translate to a set consisting of the add operation and two (ordered) arguments: 1 and 2.

Now, typically transformations don’t come on their own and are not nearly as trivial as the above. You might desire to compute a + b, where both a and b are results of previous transformations.

This can be represented as a directed acyclic graph, or short: DAG. For the example above, the graph is rather small:

  a:(…)   b:(…)
    ^      ^
     \    /
      \  /
(add, a, b)

The nodes in this graph represent transformations. These nodes can have labels, e.g. a and b, which are called references or tags in dantro terminology. As illustrated by the example above, the tags can be used in place of arguments to denote that the result of a previous transformation (with the corresponding label) should be used.

The directed edges in the graph represent dependencies. The acyclic in DAG is required such that the computation of a transformation result does not end in an infinite loop due to a circular dependency.

The Transformation and TransformationDAG dantro classes implement exactly this structure, making the following features available:

  • Easy and generic access to data stored in an associated DataManager

  • Definition of arbitrary DAGs via dictionary-based configurations

  • Syntax optimized to make specification via YAML easy

  • Shorthand notations available

  • New and custom operations can be registered

  • There are no restrictions on the signature of operations

  • Caching of transformations is possible, avoiding re-calculation of computationally expensive transformations

  • Transformations are uniquely representable by a hash

The Transformation Syntax#

This section will guide you through the syntax used to define transformations. It will explain the basic elements and inner workings of the mini-language created for the purpose of the DAG.

Note

This explanation goes into quite some detail; and it’s quite important to understand the underlying structures of the If you feel like you would like to jump ahead to see what awaits you, have a look at the Minimal Syntax.

The TransformationDAG#

The structure a user (you!) is mainly interacting with is the TransformationDAG class. It takes care to build the DAG by creating Transformation objects according to the specification you provided. In the following, all YAML examples will represent the arguments that are passed to the TransformationDAG during initialization.

Basics#

Ok, let’s start with the basics: How can transformations be defined? For the sake of simplicity, let’s only look at transformations that are fully independent of other transformations.

Explicit syntax#

The explicit syntax to define a single Transformation via the TransformationDAG looks like this:

transform:
  - operation: add
    args: [1, 2]
    kwargs: {}

The transform argument is the main argument to specify transformations. It accepts a sequence of mappings. Each entry of the sequence contains all arguments that are needed to create a single Transformation.

As you see, the syntax is very close to the above definition of what a dantro transformation contains.

Note

The args and kwargs arguments can also be left out, if no positional or keyword arguments are to be passed, respectively. This is equivalent to setting them to ~ or empty lists / dicts.

Specifying multiple transformations#

To specify multiple transformations, simply add more entries to the transform sequence:

transform:
  - operation: add
    args: [3, 4]
  - operation: sub
    args: [8, 2]
  - operation: mul
    args: [6, 7]

Assigning tags#

Nodes of the DAG all have a unique identifier in the form of a hash string, which is a 32 character hexadecimal string. While it can be used to identify a transformation, the easiest way to refer to it is by using a so-called tag.

Tags are simply plain text pointers to a specific hash, which in turn denotes a specific transformation. To add a tag to a transformation, use the tag key.

transform:
  - operation: add
    args: [3, 4]
    tag: some_addition
  - operation: sub
    args: [8, 2]
    tag: some_substraction
  - operation: mul
    args: [6, 7]
    tag: the_answer

Note

No two transformations can have the same tag.

Advanced Referencing#

In the examples above, all transformations were independent of each other. Having completely independent and disconnected nodes, of course, defeats the purpose of having a DAG structure.

Now let’s look at proper, non-trivial DAGs, where individual transformations use the results of other transformations.

Referencing other Transformations#

Other transformations can be referenced in three ways, each with a corresponding Python class and an associated YAML tag:

  • DAGReference and !dag_ref: This is the most basic and most explicit reference, using the transformations’ hash to identify a reference.

  • DAGTag and !dag_tag: References by tag are the preferred references. They use the plain text name specified via the tag key.

  • DAGNode and !dag_node: Uses the ID of the node within the DAG. Mostly for internal usage!

Note

When the DAG is built, all references are brought into the most explicit format: DAGReference s. Thus, internally, the transformation framework works only with hash references.

The best way to refer to other transformations is by tag: there is no ambiguity, it is easy to define, and it allows you to easily build a DAG tree structure. A simple example with three nodes would be the following:

transform:
  - operation: add
    args: [3, 4]
    tag: some_addition
  - operation: sub
    args: [8, 2]
    tag: some_substraction
  - operation: mul
    args:
      - !dag_tag some_addition
      - !dag_tag some_substraction
    tag: the_answer

Which is equivalent to:

some_addition = 3 + 4
some_substraction = 8 - 2
the_answer = some_addition * some_substraction

References can appear within the positional and the keyword arguments of a transformation. As you see, they behave quite a bit like variables behave in programming languages; the only difference being: you can’t reassign a tag and you should not form circular dependencies.

Using the result of the previous transformation#

When chaining multiple transformations to each other and not being interested in the intermediate results, it is tedious to always define tags:

transform:
  - operation: mul
    args: [1, 2]
    tag: f2
  - operation: mul
    args: [!dag_tag f2, 3]
    tag: f3
  - operation: mul
    args: [!dag_tag f3, 4]
    tag: f4
  - operation: mul
    args: [!dag_tag f4, 5]
    tag: f5

Let’s say, we’re only interested in f5. The only thing we want is that the result from the previous transformation is carried on to the next one. The with_previous_result feature can help in this case: It adds as the first positional argument a reference to the previous node. Thus, it is no longer necessary to define a tag.

transform:
  - operation: mul
    args: [1, 2]
  - operation: mul
    args: [3]
    with_previous_result: true
  - operation: mul
    args: [4]
    with_previous_result: true
  - operation: mul
    args: [5]
    with_previous_result: true
    tag: f5

Note that the args, in that case, specify one fewer positional argument.

Warning

Using !dag_node in your specifications is not recommended. Use it only if you really know what you’re doing.

In case the result of the previous transformation should not be used in place of the first positional argument but somewhere else, there is the !dag_prev YAML tag, which creates a node reference to the previous node:

transform:
  - operation: define
    args: [10]
  - operation: sub
    args: [0, !dag_prev ]
  - operation: div
    args: [1, !dag_prev ]
  - operation: power
    args: [10, !dag_prev ]
    tag: my_result

Note

Notice the space behind !dag_prev. The YAML parser might complain about a character directly following the tag, like …, !dag_prev].

Computing Results#

To compute the results of the DAG, invoke the TransformationDAG‘s compute() method.

It can be called without any arguments, in which case the result of all tagged transformations will be computed and returned as a dict. If only the result of a subset of tags should be computed, they can also be specified.

Computing results works as follows:

  1. Each tagged Transformation is visited and its own compute() method is invoked

  2. A cache lookup occurs, attempting to read the result from a memory or file cache.

  3. The transformations resolve potential references in their arguments: If a DAGReference is encountered, the corresponding Transformation is resolved and that transformation’s compute() method is invoked. This traverses all the way up the DAG until reaching the root nodes which contain only basic data types (that need no computation).

  4. Having resolved all references into results, the arguments are assembled, the operation callable is resolved, and invoked by passing the arguments.

  5. The result is kept in a memory cache. It can additionally be stored in a file cache to persist to later invocations.

  6. The result object is returned.

Note

Only nodes that are tagged can be part of the results. Intermediate results still need to be computed, but it will not be part of the results dict. If you want an intermediate result to be available there, add a tag to it.

This also means: If there are parts of the DAG that are not tagged at all, they will not be reached by any recursive argument lookup.

Hint

Use the compute_only argument of compute() to specify which tags are to be computed. If not given, all tags will be computed, unless they start with a . or _ (these are so-called “private” tags).

To compute private tags directly, include them in compute_only.

You can also force computation of a node, even if untagged, by adding force_compute to the transformation, see below.

Hint

To learn which parts of the computation require the most time, e.g. in order to evaluate whether to cache the result, inspecting the DAG profile statistics can be useful. The TransformationDAG‘s verbosity attribute controls how extensively statistics are written to the log output. By default (verbosity 1), only per-node statistics are emitted. For levels >= 2, per-operation statistics are shown alongside.

Resolving and applying operations#

Let’s have a brief look into how the operation argument is actually resolved and how the operation is then applied.

This feature is not specific to the DAG, but the DAG uses the data_ops module, which implements a database of available operations and the apply_operation() function to apply an operation. Basically, this is a thin wrapper around a function lookup and its invocation.

For a full list of available data operations, see here.

Hint

You can also use the import operation to retrieve a callable (or any other object) via a Python import and then use the call operation to invoke it. These two operations are combined in the import_and_call operation:

transform:
  - operation: import_and_call
    args: [numpy.random, randint]
    kwargs:
      low: 0
      high: 10
      size: [2, 3, 4]

To specifically register additional operations, use the register_operation() function. This should only be done for operations that are not easily usable via the import and call operations.

Forcing computation of individual nodes#

Sometimes, it’s useful to force the computation of an individual node. To that end, simply set the force_compute option for a transformation:

transform:
  - operation: div
    args: [1, 0]
    force_compute: true

In this example, the node will always be computed, obviously leading to a ZeroDivisionError.

A few remarks:

  • Force-computed tags are computed before the tags specified in compute_only.

  • Typical use case is during debugging, where you want to make sure that an operation really is carried out.

  • Unlike tagged nodes, their results are not available in the results dict. Even if the node is tagged, it will only appear in the results dict if it is part of compute_only.

Selecting from the DataManager#

The above examples are trivial in that they do not use any actual data but define some dummy values. This section shows how data can be selected from the DataManager that is associated with the TransformationDAG.

The process of selecting data is not different than other transformations. It makes use of the getitem operation that would also be used for regular item access, and it uses the fact that the data manager is available via the dm tag.

Note

The DataManager is also identified by a hash, which is computed from its name and its associated data directory path. Thus, managers for different data directories have different hashes.

The select interface#

As selecting data from the DataManager is a common use case, the TransformationDAG supports the select argument besides the transform argument.

The select argument expects a mapping of tags to either strings (the path within the data tree) or further mappings (where more configurations are possible):

select:
  some_data: path/to/some_data
  more_data:
    path: path/to/more_data
    # ... potentially more kwargs
transform: ~

The results dict will then have two tags, some_data and more_data, each of which is the selected object from the data tree.

Note

The above example is translated into the following basic transformation specifications:

transform:
  - operation: getitem
    args: [!dag_tag dm, path/to/more_data]
    tag: more_data
  - operation: getitem
    args: [!dag_tag dm, path/to/some_data]
    tag: some_data

Note that the order of operations is sorted alphabetically by the tag specified under the select key.

Directly transforming selected data#

Often, it is desired to apply some sequential transformations to selected data before working with it. As part of the select interface, this is also possible:

select:
  square_increment:
    path: path/to/some_data
    with_previous_result: true
    transform:
      - operation: squared
      - operation: increment

  some_sum:
    path: path/to/more_data
    transform:
      - operation: getattr
        args: [!dag_prev , data]
      - operation: sub
        args: [0, !dag_prev ]
      - operation: .sum
        args: [!dag_prev ]
transform:
  - operation: add
    args: [!dag_tag square_increment, !dag_tag some_sum ]
    tag: my_result

Notice the difference between square_increment, where the result is carried over, and some_sum, where the reference has to be specified explicitly. As visible there, within the select interface, the with_previous_result option can also be specified such that it applies to a sequence of transformations that are based on some selection from the data manager.

Note

The parser expands this syntax into a sequence of basic transformations.

It does so before any other transformations from the transform argument are evaluated. Thus, whichever tags are defined there are not available from within select!

Changing the selection base#

By default, selection happens from the associated DataManager, tagged dm. This option can be controlled via the select_base property, which can be set both as argument to __init__ and afterward via the property. The property expects either a DAGReference object or a valid tag string.

If set, all following select arguments are using that reference as the basis, leading to getitem operations on that object rather than on the data manager.

As the select arguments are evaluated before any transform operations, only the default tags are available during initialization. To widen the possibilities, the TransformationDAG allows the base_transform argument during initialization; this is just a sequence of transform specifications, which are applied before the select argument is evaluated, thus allowing to select some object, tag it, and use that tag for the select_base argument.

Note

The select_path_prefix argument offers similar functionality, but merely prepends a path to the argument. If possible, the select_base functionality should be preferred over select_path_prefix as it reduces lookups and cooperates more nicely with the file caching features.

Background Information

Internally, when the select specification is evaluated, it is set to select against a special tag select_base; by default, this is the same as the dm special tag.

Effectively, the select feature always selects starting from the object the select_base property points to at the time the nodes are added to the DAG. In other words, if the select_base is changed after the nodes were added, this will not have any effect.

For meta-operations this means that the base of selection is not relevant at definition of the meta-operations; the base gets evaluated when the meta-operation is used.

The define interface#

So far, we have seen two ways to add transformation nodes to the DAG: via transform or via select. These are based either on directly adding the nodes, giving full control, or adding transformations based on a selection of data.

The define interface is a combintion of these two approaches: same as select, it revolves around the final tag that’s meant to be attached to the definition, but it does not require a data selection like select does.

Let’s look at an example that combines all these ways of adding transformations:

define:
  exponent: 4               # directly define some object
  days_to_seconds_factor:   # use a sequence of TransformationsDAG
    - expression: "60 * 60 * 24"
    - float
select:
  some_data: path/to/some_data
  more_data: path/to/more_data
transform:
  - add: [!dag_tag some_data, !dag_tag more_data]
  - mul: [!dag_prev , !dag_tag days_to_seconds_factor]
  - print
  - pow: [!dag_prev , !dag_tag exponent]
    tag: my_result

Here, the exponent as well as some conversion factor tags are defined not ad-hoc but separately via the define interface. As can be seen in the example, there are two ways to do this:

  • If providing a list or tuple type, it is interpreted as a sequence of transformations, accepting the same syntax as transform. After the final transformation, another node is added that sets the specified tag, days_to_seconds_factor in this example.

  • If prodiving any other type, it is interpreted directly as a definition, adding a single transformation node that holds the given argument, the integer 4 in the case of the exponent tag.

Note

The define argument is evaluated before the other two. Subsequently, tags defined via define can be used within select or transform, but not the other way around.

Hint

In the context of plotting, the define interface has an important benefit over the select and transform syntax for adding nodes to the DAG: It is dictionary-based, which makes it very easy to recursively update its content; this is very useful for Plot Configuration Inheritance.

⚠️ Note: When using the DAG for plot data selection, not all arguments can be exposed on the top-level of the plot configuration; the define argument is one of the arguments that is nested in a DAG-specific config entry dag_options:

my_plot:
  dag_options:
    define:
      exponent: 4
      # ...

  select:
    # ...
  transform:
    - # ...

Individually adding nodes#

Nodes can be added to TransformationDAG during initialization; all the examples above are written in that way. However, transformation nodes can also be added after initialization using the following two methods:

  • add_node() adds a single node and returns its reference.

  • add_nodes() adds multiple nodes, allowing the define, select, and transform arguments in the same syntax as during initialization. Internally, this parses the arguments and calls add_node().

Minimal Syntax#

To make the definition a bit less verbose, there is a so-called minimal syntax, which is internally translated into the explicit and verbose one documented above. This can make DAG specification much easier:

select:
  some_data: path/to/some_data
  more_data: path/to/more_data
transform:
  - add: [!dag_tag some_data, !dag_tag more_data]
  - increment
  - print
  - pow: [!dag_prev , 4]
    tag: my_result

This DAG will have three custom tags defined: some_data, more_data and my_result. Computation of the my_result tag is equivalent to:

my_result = ((some_data + more_data) + 1) ** 4

As can be seen above, the minimal syntax gets rid of the operation, args and kwargs keys by allowing to specify it as <operation name>: <args or kwargs> or even as just a string <operation name>, without further arguments.

With arguments, <operation name>: <args or kwargs>#

When passing a sequence (e.g. [foo, bar]) the arguments are interpreted as positional arguments; when passing a mapping (e.g. {foo: bar}), they are treated as keyword arguments.

Hint

In this shorthand notation it is still possible to specify the respective “other” types of arguments using the args or kwargs keys. For example:

transform:
  - my_operation: [foo, bar]
    kwargs: { some: more, keyword: arguments }
  - my_other_operation: {foo: bar}
    args: [some, positional, arguments]

Without arguments, <operation name>#

When specifying only the name of the operation as a string (e.g. increment and print), it is assumed that the operation accepts only a single positional argument and no other arguments. That argument is automatically filled with a reference to the result of the previous transformation, i.e.: the result is carried over.

For example, the above transformation with the increment operation would be translated to:

operation: increment
args: [!dag_prev ]
kwargs: {}
tag: ~

Operation Hooks#

The DAG syntax parser allows attaching additional parsing functions to operations, which can help to supply a more concise syntax. These so-called operation hooks are described in more detail here. As an example, the expression operation can be specified much more conveniently with the use of its hook. Taking the example from above, the same can be expressed as:

select:
  some_data: path/to/some_data
  more_data: path/to/more_data
transform:
  - expression: (some_data + more_data + 1) ** 4
    tag: my_result

In this case, the hook automatically extracts the free symbols (some_data and more_data) and translates them to the corresponding DAGTag objects. Effectively, it parses the above to:

select:
  some_data: path/to/some_data
  more_data: path/to/more_data
transform:
  - expression: (some_data + more_data + 1) ** 4
    kwargs:
      symbols:
        some_data: !dag_tag some_data
        more_data: !dag_tag more_data
    tag: my_result

If you care to deactivate a hook, set the ignore_hooks flag for the operation:

operation: some_hooked_operation
args: [foo, bar]
ignore_hooks: true

Warning

Failing operation hooks will emit a logger warning, informing about the error; they do not raise an exception. While this might not lead to a failure during parsing, it might lead to an error during computation, e.g. when you are relying on the hook to have adjusted the operation arguments.

Depending on the operation arguments, there can be cases where the hook will not be able to perform its function because it lacks information that is only available after a computation. In such cases, it’s best to deactivate the hook as described above.

Graph representation and visualization#

The TransformationDAG has the ability to represent the internally used directed acyclic graph as a networkx.classes.digraph.DiGraph. By calling the generate_nx_graph() method, the Transformation objects are added to a graph and the dependencies between these transformations are added as directed edges.

This can help to better understand the generated DAG and is useful not only for debugging but also for optimization, as it allows to show the associated profiling information.

Hint

It can be configured whether the edges should represent the “flow” of results through the DAG (edges pointing towards the node that requires a certain result) or whether they should point towards a node’s dependency.

By default, generate_nx_graph() has edges_as_flow set to True, thus having edges point in the effective direction of computation.

Visualization#

In addition to generating the graph object, the visualize() method can generate a visual output of the DAG:

DAG visualization

In this example, the - my_result - node is tagged at the bottom and the arrows come from the transformations that these operations depend on. Effectively, calculation starts at the top, with data being read from the dm node, the associated DataManager, then following the arrows towards the my_result node and applying the specified operations like squared, increment and so on.

The circles in the background show the status of the computation, green meaning that a node’s result was computed as expected; other colors and their corresponding status are detailed in the legend. The node status can indicate where in a DAG computation routine an error occurred. To control this, have a look at the show_node_status argument and the annotation_kwargs, where the legend can be controlled.

Note

Operation arguments cannot easily be shown as it would quickly become too cluttered. For that reason, the visualization typically restricts itself to showing the operation name, the result (if computed), and the tag (if set).

See visualize() for more info.

Hint

If using the data transformation framework for plot data selection, visualization is deeply integrated there; see DAG Visualization.

Hint

DAG visualization works much better with pygraphviz installed, because it gives access to more capable layouting algorithms.

Export#

To post-process the DAG data elsewhere, use the standalone export_graph() function.

Full syntax specification of a single transformation node#

To illustrate the possible arguments for creating a transformation node via add_node(), the following block contains a full specification of available keys and arguments. It is a combination of arguments to Transformation and arguments that are handled by TransformationDAG, which is aware of the whole DAG.

Note that this is the explicit representation, which is a bit verbose. Except for operation, args, kwargs and tag, all entries are set to default values.

operation: some_operation       # The name of the operation
args:                           # Positional arguments
  - !dag_tag some_result        # Reference to another result
  - second_arg
kwargs:                         # Keyword arguments
  one_kwarg: 123
  another_kwarg: foobar
salt: ~                         # Is included in the hash; set a value here
                                # if you would like to provoke a cache miss
fallback: ~                     # May only be given if ``allow_failure`` is
                                # also set, in which case it specifies a
                                # fallback value (or reference) to use
                                # instead of the operation result.

# All arguments _below_ are NOT taken into account when computing the hash
# of this transformation. Two transformations that differ _only_ in the
# arguments given below are considered equal to each other.

tag: my_result                  # The tag of this transformation. Optional.
force_compute: ~                # Used to force computation of this node
                                # without needing to assign a tag.

allow_failure: ~                # Whether to allow this transformation to
                                # fail during computation or resolution of
                                # the arguments (i.e.: upstream error).
                                # Special options are: log, warn, silent

memory_cache: True              # If False, will not keep the computed
                                # result in memory but either re-compute it
                                # or load it from the file cache.

file_cache:                     # File cache options
  read:                         # Read-related options
    enabled: false              # Whether to read from the file cache
    always: false               # If true, will always read from the file
                                # cache, regardless of whether the result
                                # was already stored in the memory cache
                                # or just computed.
    load_options:
      unpack: ~                 # Whether to unpack the result from the
                                # dantro container it was loaded into.
                                # If None, will do so only for numeric
                                # types (xarray and numpy arrays)
      # ... further arguments are passed on to DataManager.load
  write:                        # Write-related options
    enabled: false              # Whether to write to the file cache

    # If writing is enabled, the following options determine whether a
    # cache file should actually be written (does not always make sense)
    always: false               # If true, skips other conditions below and
                                # ensures that a cache file is created.
                                # NOTE: This will not *overwrite* an
                                # existing cache file by default; see the
                                # ``allow_overwrite`` parameter for that.
    allow_overwrite: false      # If false, will not write if a cache file
                                # already exists (even with ``always`` set)
    min_size: ~                 # If given, the result needs to have at
                                # least this size (in bytes) for it to be
                                # written to a cache file.
    max_size: ~                 # Like min_size, but upper boundary
    min_compute_time: ~         # If given, a cache file is only written
                                # if the computation time of this node on
                                # its own, i.e. without the computation
                                # time of the dependencies exceeded this
                                # value.
    min_cumulative_compute_time: ~  # Like min_compute_time, but actually
                                # taking into account the time it took to
                                # compute results of the dependencies.

    # Options used when storing a result in the cache
    storage_options:
      raise_on_error: false     # Whether to raise if saving failed
      attempt_pickling: true    # Whether to attempt pickling if saving
                                # via a specific save function failed
      pkl_kwargs: {}            # Passed on to pkl.dumps
      ignore_groups: true       # Whether to attempt storing dantro groups
      # ... additional arguments passed on to the specific saving function

Note

This does not reflect any arguments made available by the DAG parser! Features like the minimal syntax or the operation hooks are handled prior to the initialization of a Transformation object.

Hint

Often the easiest way to learn is by example. Make sure to check out the Data Transformation Examples page, where you will find practical examples that go beyond what is shown here.

Meta-Operations#

In essence, the transformation framework, as described above, can be used to define a sequence of data operations, just like a sequential program code could do. Now, what if parts of these operations are used multiple times? In a typical computer program, one would define a function to modularize part of the program. The equivalent construct in the data transformation framework is a so-called meta-operation, which can be characterized in the following way:

  • It can have input arguments that define which objects it should work on

  • It consists of a number of operations that transform the arguments in the desired way

  • It has one (and only one) output, the return value

How are meta-operations defined? Meta-operations can be defined in just the same way as regular transformations are defined, with some additional syntax for defining positional arguments (args) and keyword arguments (kwargs). Let’s look at an example:

# Define meta-operations
meta_operations:
  # Compute (x**2 + 1) for a positional input argument x
  square_plus_one:
    - pow: [!arg 0, 2]
    - add: [!dag_prev , 1]   # <-- this operation's result is the "return value"

  # Select some data and directly compute its mean
  select_and_compute_mean:
    select:
      data:
        path: !kwarg to_select
    transform:
      - .mean: !dag_tag data

This defines two meta-operations: square_plus_one (with one positional argument) and select_and_compute_mean (with the to_select keyword argument). During initialization of TransformationDAG, these can be passed using the meta_operations argument.

How are meta-operations used? In exactly the same way as all regular data operations: simply define their name as the operation argument of a transformation.

# Use the meta-operations within the regular data transformation, alongside
# the already-existing data operations
transform:
  - select_and_compute_mean:
      to_select: path/to/some_data
    tag: some_data_mean
  - select_and_compute_mean:
      to_select: path/to/more_data
    tag: more_data_mean
  - add: [!dag_tag some_data_mean, !dag_tag more_data_mean]
  - square_plus_one: !dag_prev
    tag: result

Here, one of the meta-operations is used to compute two mean values from a selection; these are then added together via the regular add operation; finally, the other meta-operation is applied to that sum, yielding the result. While the individual meta-operations are not complex in themselves, this illustrates how repeatedly invoked transformations can be modularized.

Note

The examples in these sections use the meta_operations top-level entry to illustrate the definition of meta-operations. The transform and/or select top-level entries are used to denote how meta-operations can be invoked (in the same way as regular operations).

As a brief summary:

  • Meta-operations are defined via the meta_operations argument of TransformationDAG, using the same syntax as for other transformations.

  • They can specify positional and keyword arguments and have a return value.

  • They can be used for the operation argument of any transformation, same as other available data operations.

  • Meta-operations allow modularization and thereby simplify the definition of data transformations.

Hint

To use meta-operations for plot data selection, define them under the dag_options.meta_operations key of a plot configuration.

Defining meta-operations#

The example above already gave a glimpse into how to define meta-operations. In many ways, this works exactly the same as defining transformations, e.g. under the transform argument.

Specifying arguments#

Like Python functions, meta-operations can have two kinds of arguments:

  • Positional arguments, defined using the !arg <position> YAML tag

  • Keyword arguments, defined using the !kwarg <name> YAML tag

These can be used anywhere inside the meta-operation specification and serve as placeholders for expected arguments. Let’s look at an example:

meta_operations:
  my_equation:  # [(a + b) * c - d] / e
    - add: [!kwarg a, !kwarg b]
    - mul: [!dag_prev , !kwarg c]
    - sub: [!dag_prev , !kwarg d]
    - div: [!dag_prev , !kwarg e]

transform:
  - my_equation:
      a: 1
      b: 10
      c: 8
      d: 4
      e: 2
    tag: the_answer

When the meta-operation gets translated into nodes, the corresponding positional and keyword arguments are replaced with the values from the args and kwargs of the transformation specification.

Some remarks:

  • Positional and keyword arguments can be mixed

  • Arguments can be referred to multiple times within a meta-operation definition

  • The set of positional arguments, if specified, needs to include all integers between zero and the highest defined !arg <position>.

  • Optional arguments and variable positional or keyword arguments are not supported (yet).

Return values#

Meta-operations always have one and only one return value: the last defined transformation.

Hint

To have “multiple” return values, e.g. to return an intermediate result, aggregate objects into a dict that can then be unpacked outside of the meta-operation. For an example, see Aggregate return values.

Using select within meta-operations#

The definitions inside meta_operations can have two possible formats:

meta_operations:
  # A -- as list  ==> transformations only
  a_plus_b_cubed:
    - add: [!arg 0, !arg 1]
    - pow: [!dag_prev , 3]

  # B -- as dict  ==>  selections _and_ transformations
  select_and_square:
    select:
      data:
        path: !kwarg to_select
    transform:
      - squared: !dag_tag data

transform:
  - a_plus_b_cubed: [1, 2]
    tag: result1
  - select_and_square:
      to_select: path/to/some_data
    tag: result2

As can be seen above, the dict-based definition supports using the select interface. Importantly, this supports parametrization: simply use !arg or !kwarg inside the select specification, e.g. to make the path of the to-be-selected object an argument to the meta-operation.

Internal tags#

When defining simple meta-operations, passing the output of the previous operation through to the next one using !dag_prev usually suffices to connect operations. Such meta-operations are essentially linear DAGs.

However, to define non-linear meta-operations (or: general DAGs), it needs to be possible to use the result of any previously specified transformation. For that purpose, the tag entry and the !dag_tag YAML tag can be used, same as in the usual specification of references between transformations:

meta_operations:
  my_meta_operation:  # [(x+1) * (x-1)] / (2*y)
    - add: [!arg 0, +1]
      tag: left
    - add: [!arg 0, -1]
      tag: right
    - mul: [!dag_tag left, !dag_tag right]
      tag: top
    - mul: [!arg 1, 2]
    - div: [!dag_tag top , !dag_prev ]

transform:
  - my_meta_operation: [9, 2]  # [(9+1) * (9-1)] / (2*2) == 20
    tag: result

Internal tags are all tag definitions inside the meta_operation definition. These tags are solely accessible within the meta-operation and will not be available as results later on (only the return value will). In the above example, the left, right, and top tags are internal tags and they are referenced using the already-known !dag_tag YAML tag.

This is in contrast to the regular tag definitions (the result tag in the example), which is a regular tag. Effectively, the regular tag is attached to the last transformation of the meta-operation, here being the div operation.

Note

In order to avoid silent errors and reduce unexpected behaviour, all internally defined tags need to be used within the meta-operation.

Argument default values#

Meta operation arguments !arg and !kwarg can also have default values. These are defined by passing a list of length 2 to the YAML tags (instead of a scalar number for positional arguments or name for keyword arguments).

For instance, if you want an optional keyword argument foo, define it as:

!kwarg [foo, my_default_value]

Equivalently for positional arguments:

!arg [0, my_default_value]

Let’s look at an example where the my_increment meta-operation would increment by one per default or by some other value, if desired:

meta_operations:
  my_increment:
    - add: [!arg 0, !arg [1, 1]]

transform:
  - my_increment: [0]
    tag: one
  - my_increment: !dag_prev
    tag: two
  - my_increment: [!dag_prev , 8]
    tag: ten

The above meta-operation is equivalent to the following Python function with one required positional-only argument and one optional positional-only argument:

def my_increment(x, delta = 1, /):
    return x + delta

one = my_increment(0)
two = my_increment(one)
ten = my_increment(two, 8)

For a larger example that is using keyword arguments, see below.

Hint

Default values need not be scalar, they can be anything — as long as they do not contain any Placeholder objects like tags, references, or other argument definitions.

Warning

To clearly distinguish which arguments are optional and which ones are required, make sure that any !arg or !kwarg with a default value has a default value for all those occurrences of the arguments in your meta-operation:

There should never be a !arg [0, 42] and !arg 0 in your meta-operation at the same time.

Examples#

prime_multiples#

The following example performs operations on the arguments and then uses internal tags (!dag_tag) to connect their output to a result.

meta_operations:
  prime_multiples:
    # Define powers of primes 2, 3, 5, and 7
    - pow: [2, !kwarg [base2, 0]]
      tag: b2
    - pow: [3, !kwarg [base3, 0]]
      tag: b3
    - pow: [5, !kwarg [base5, 0]]
      tag: b5
    - pow: [7, !kwarg [base7, 0]]
      tag: b7
    # Compute their product
    - np.: [prod, [!dag_tag b2, !dag_tag b3, !dag_tag b5, !dag_tag b7]]

transform:
  - prime_multiples:
      base2: 2
      base3: 1
      # base5: 0
      base7: 3
    tag: result

As can be seen in the following plot, the meta-operation is unpacked into individual transformation nodes:

DAG visualization

Hint

The DAG visualization also shows which operation originated from which meta-operation (in parentheses below the operation name). Here, all originate from prime_multiples.

Aggregate return values#

In this example, a dict operation is used to return multiple results from a meta-operation.

meta_operations:
  # Given some input data (as positional argument), compute a bunch of
  # statistical quantities. To return them, aggregate them into a dict.
  compute_stats:
    # Make sure it's an xarray object
    - xr.DataArray: !arg 0
      tag: data

    # Compute the statistics
    - .mean: !dag_tag data
      tag: mean
    - .std: !dag_tag data
      tag: std
    - .median: !dag_tag data
      tag: median
    - .min: !dag_tag data
      tag: min
    - .max: !dag_tag data
      tag: max
    - .quantile: [!dag_tag data, 0.25]
      tag: q25
    - .quantile: [!dag_tag data, 0.75]
      tag: q75

    # Aggregate into a dict as return value
    - dict:
        mean: !dag_tag mean
        std: !dag_tag std
        median: !dag_tag median
        min: !dag_tag min
        max: !dag_tag max
        q25: !dag_tag q25
        q75: !dag_tag q75

# Usage example: select some data and get some of the desired statistics
select:
  some_data: path/to/some_data
transform:
  - compute_stats: !dag_tag some_data
    tag: some_stats

  - getitem: [!dag_tag some_stats, mean]
    tag: some_mean
  - getitem: [!dag_tag some_stats, std]
    tag: some_std
  - getitem: [!dag_tag some_stats, median]
    tag: some_median

Note that by aggregating results into an object, the DAG will not be able to discern whether a branch of the compute_stats meta-operation is actually needed, thus potentially computing more results than required. In order to avoid computing more nodes than necessary, aggregated return values should be used sparingly; ideally, use them only to return an intermediate result.

This packing and unpacking can also be observed in the DAG plot:

DAG visualization

my_gauss#

This example shows how to define a mathematical expression (also see: operation hooks) and exposing its symbols as arguments of the meta-operation:

meta_operations:
  # A meta-operation that defines a gaussian
  my_gauss:
    - expression: a * exp(- (x - mu)**2 / (2 * sigma**2))
      kwargs:
        symbols:
          x: !kwarg x
          a: !kwarg a
          mu: !kwarg mu
          sigma: !kwarg sigma

transform:
  # Compute the Gaussian for two values
  - my_gauss:
      a: 1.
      mu: 0.
      sigma: 1.
      x: 0.
    tag: default_gaussian
  - my_gauss:
      a: 1.
      mu: 23.
      sigma: 10.
      x: 23.
    tag: wide_gaussian_moved

For this case, it makes a lot of sense to use default values for meta-operation arguments, thus reducing the number of keyword arguments that need to be specified:

meta_operations:
  # A meta-operation that defines a gaussian
  my_gauss:
    - expression: a * exp(- (x - mu)**2 / (2 * sigma**2))
      kwargs:
        symbols:
          x: !kwarg x
          a: !kwarg [a, 1.]
          mu: !kwarg [mu, 0.]
          sigma: !kwarg [sigma, 1.]

transform:
  # Compute the Gaussian for two values
  - my_gauss:
      x: 0.
    tag: default_gaussian
  - my_gauss:
      x: 23.
      a: 1.
      mu: 23.
      sigma: 10.
    tag: wide_gaussian_moved

Hint

If you do not want to define default arguments, e.g. because you want to control the shared defaults via some YAML-based logic, you can also reduce the number of repeated arguments using YAML anchors and inheritance:

transform:
  - my_gauss: &my_gauss_defaults    # <-- defines the defaults
      a: 1.
      mu: 0.
      sigma: 1.
      x: 0.
    tag: default_gaussian
  - my_gauss:
      <<: *my_gauss_defaults        # <-- re-use defaults ...
      a: 10.                        #     ... and update with new values
    tag: scaled_gaussian
  - my_gauss:
      <<: *my_gauss_defaults
      mu: -42.
    tag: moved_gaussian

Remarks & Caveats#

Note the following remarks regarding the definition and use of meta-operations:

  • Inside meta-operations, no outside tags except the “special” tags (dag, dm, select_base) can be used. Further inputs should be handled by adding arguments to the meta-operation as described above.

  • When using the select syntax in the definition of a meta-operation and aiming to define an argument, note that the long syntax needs to be used:

    select:
      # Correct
      some_data:
        path: !kwarg some_data_path
    
      # WRONG! Will not work.
      other_data: !kwarg other_data_path
    
  • When defining a meta-operation and using an operation that makes use of an operation hook, the tags created by the hook need to be explicitly exposed as arguments, otherwise there will be an Unused tags ... error. To expose them, there are two ways:

    • Use them internally by adding a define, dict, or list operation prior to the operation that uses the hook; then explicitly specify them as arguments there.

    • In the case of the expression operation hook, use the kwargs.symbols entry to directly define them as arguments, as done in the my_gauss example above.

  • A meta-operation always adds a so-called “result node”, which uses the pass operation to make the result of the meta-operation available. When using a meta-operation, the arguments tag and file_cache (see below) as well as any error handling arguments are added only to this result node. For all other transformation nodes of a meta-operation, the following holds:

    • They may have only internal tags attached

    • They may define their own file_cache behavior; if they do not, the default values for file caching are used.

    • They are free to define their own error handling behavior.

Error Handling#

Operations are not always guaranteed to succeed. To define more robust operations, some form of error handling is required, akin to try-except blocks in Python.

In the data transformation framework, the allow_failure option handles failing data operations and allows to specify a fallback value that should be used as result in case the operation failed. Let’s have a look:

  - float: "inf"
  - div: [1, 0]               # 1 / 0  -->  raises ZeroDivisionError
    allow_failure: true
    fallback: !dag_prev
    tag: result

Here, the ZeroDivisionError is avoided and, instead, the value of the previous node (which defines a float infinity value) is used. Subsequently, the result will be the Python floating-point inf.

Note

The allow_failure argument also accepts a few string-like values which control the verbosity of the output in case of failure:

  • log does the same as True: print a prominent log message that informs about the failed operation and the use of the fallback.

  • warn: emits a Python warning

  • silent: suppresses the message altogether

Example:

  - float: "inf"
  - div: [1, 0]
    allow_failure: silent     # can also be: True, log, warn, False
    fallback: !dag_prev

For debugging, make sure to not use silent.

Hint

The fallback argument accepts not only scalars, but also sequences or mappings, which in turn may contain !dag_tag references.

Upstream errors#

Sometimes, an error only becomes relevant in a later operation and it makes sense to defer error handling to that point. The analogy to Python exception handling would be to handle the error not directly where it occurs but in an outside scope.

This is also possible within the error handling framework, because allow_failure pertains to both the computation of the specified operation as well as the resolution of its arguments. As the resolution of arguments triggers the computation of dependent nodes (and their dependencies, and so forth), an upstream error may also be caught in a downstream node:

  # Example input: assume that this may also be the output from previous
  # operations which are used to calculate something else ...
  - define: -1.23
    tag: some_value
  - define: +1
    tag: some_other_value

  # Perform some potentially problematic operations with these ...
  - import_and_call: [math, log10, !dag_tag some_value]   # --> ValueError
    tag: log10_value

  - import: [np, pi]
    tag: pi
  - sub: [!dag_tag some_other_value, 1.]
  - div: [!dag_tag pi, !dag_prev ]                        # --> ZeroDivisionError
    tag: pi_over_some_other_value

  # ... leading to the result
  - add: [!dag_tag log10_value, !dag_tag pi_over_some_other_value]
    allow_failure: true
    fallback: 42
    tag: my_result

In this example, the nodes tagged log10_value and pi_over_some_other_value are both problematic but do not specify any error handling. However, we may only be interested in my_result, which depends on those two transformation results. Let’s say, we specified compute_only: [my_result]. What would happen in such a case?

  • The transformation tagged my_result is looked up in the DAG.

  • The transformation’s arguments are recursively resolved, triggering lookup of the dependencies log10_value and pi_over_some_other_value.

  • The referenced transformations would in turn look up their arguments and finally lead to the application of the problematic operations (div and math.log10), which will fail for the arguments in this example.

  • An error is raised during those operations.

  • The error propagates back to the my_result transformation.

  • With allow_failure: true, the error is caught and the fallback value is used instead.

Warning

The above example only works with compute_only: [my_result]. If the problematic tags were to be computed directly, e.g. via compute_only: all, they would raise an error because they do not specify any error handling themselves.

Note

This example is purely for illustration! Typically, one would define these operations using numpy and they would not raise exceptions but issue a RuntimeWarning and use nan as result.

Error handling within select#

The select operation may also specify a fallback. This fallback will only be applied to the getitem operation which is used to look up the path from the specified selection base:

select:
  some_data: path/to/some_data
  mean_data:
    path: some/invalid/path       # The underlying `getitem` will fail ...
    allow_failure: true           # ... but is allowed to.
    fallback: [[1, 2, 3]]         # Instead, this fallback value is used.
    transform:
      - np.mean                   # ... which still works for a mean

transform:
  - expression: (some_data + mean_data + 1) ** 4
    tag: my_result

Hint

The transform elements can of course again specify their own fallbacks.

Limitations#

There are some limitations to using allow_failure within select. Mainly, specifying a fallback may be difficult in practice because other tags may not be available yet at the time where the DAG is populated with the select arguments.

The tags specified by select are added in alphabetical order and before any transformations from transform are added to the DAG. Subsequently, lookups within one select field are only possible from within select and for fields that appeared sooner in that alphabetical order. (See this issue for a potential improvement to this behavior.)

Using a tagged reference in the fallback works in the following example because '_some_fallback_data' < 'mean_data':

select:
  _some_fallback_data: path/to/some_data
  mean_data:
    path: some/invalid/path
    allow_failure: true
    fallback: !dag_tag _some_fallback_data
    transform:
      - np.mean

transform:
  - expression: (mean_data + 1) ** (-0.5)
    tag: my_result

Hint

We advise to not build overly complex fallback structures within select, e.g. using tagged fallbacks which in turn have tagged fallbacks and so forth. While possible, it may easily becomes tedious to build or maintain.

If you require more advanced error handling for certain operations, consider wrapping them into your own data operation. See Resolving and applying operations for more information.

The File Cache#

Caching of already computed results is a powerful feature of the TransformationDAG class. The idea is, that if some specific computationally expensive transformation already took place previously, it should not be necessary to compute it again.

Background#

To understand the file cache, it’s first necessary to understand the internal handling of transformations.

Within the DAG, each transformation is fully identified by its hash. If the hashes of two transformations are the same it means the operation is the same and all arguments are the same.

All Transformation objects are stored in an objects database, which maps a hash to a Python object. In effect, there is one and only one Transformation object associated with a certain hash.

Say, a DAG contains two nodes, N1 and N2, with the same hash. Then the object database contains a single transformation T, which is used in place of both nodes N1 and N2. Thus, if the result of one of the nodes is computed, the other should already know the result and not need to re-compute it.

That is what is called the memory cache: once a result is computed, it stays in memory, such that it need not be recomputed again. This is useful not only in the above situation but also when doing DAG traversal during computation.

The file cache is not much different than the memory cache: it aims to make computation results persist to reduce computational resources. With the file cache, the results can persist over multiple invocations of the transformations framework.

Configuration#

Cache directory#

Cache files need to be written in some place. This can be specified via the cache_dir argument during the initialization of a TransformationDAG; see there for details.

By default, the cache directory is called .cache and is located inside the data directory associated with the DAG’s DataManager. It is created once it is needed.

Default file cache arguments#

File cache behavior can be configured separately for each Transformation, as can be seen from the full syntax specification above.

However, it’s often useful to have default values defined that all transformations share. To do so, pass a dict to the file_cache_defaults argument. In the simplest form, it looks like this:

file_cache_defaults:
  read: true
  write: true
transform:
  - # ...

This enables both reading from the cache directory and writing to it. When passing booleans, to read and write, the default behavior is used. To more specifically configure the behavior, again see the full syntax specification above.

When specifying additional file_cache arguments within transform, the values specified there recursively update the ones given by file_cache_defaults.

Note

The getitem operations defined via the select interface always have caching disabled; it makes no sense to cache objects that have been looked up directly from the data tree.

Warning

The file cache arguments are not taken into account for computation of the transformations’ hash. Thus, if there are two transformations with the same hash, only the additional file cache arguments given to the first one are taken into account; the second ones have no effect because the second transformation object is discarded altogether.

Warning

If it is desired to have two transformations with different file cache options, the salt can be used to perturb its hash and thus force the use of the additional file cache arguments.

Reading from the file cache#

Generally, the best computation is the one you don’t need to make. If there is no result in memory and reading from cache is enabled, the cache directory is searched for a file that has as its basename the hash of the transformation that is to be computed.

If that is the case, the DataManager is used to load the data into the data tree and set the memory cache. (Note that this is Python, i.e. it’s not a copy but the memory cache is a reference to the object in the data tree.)

By default, it is not attempted to read from the cache directory. See above on how to enable it.

Note

When desiring to use the caching feature of the transformation framework, the employed DataManager needs to be able to load numerical data. If you are not already using the AllAvailableLoadersMixin, consider adding NumpyLoaderMixin, XarrayLoaderMixin, and PickleLoaderMixin to your DataManager specialization.

Hint

Sometimes it can be desired to always read from the file cache, e.g. to make use of the load_options argument. In that case, set the following arguments to make sure that a cache file will be written after a computation.

file_cache:
  read:
    enabled: true
    always: true
    load_options:
      chunks: true
  write: true

Note that the computed result may still remain in the memory cache. See Transformation on how to not keep it in memory.

Writing to the file cache#

After a computation result was either looked up from the cache or computed, it can be stored in the file cache. By default, writing to the cache is not enabled, either. See above on how to enable it.

When writing a cache file, many options can trigger that a transformation’s result is written to a file. For example, it might make sense to store only results that took a very long time to compute or that are very large.

Once it is decided that a result is to be written to a cache file, the corresponding storage function is invoked. It creates the cache directory, if it does not already exist, and then attempts to save the result object using a set of different storage functions.

There are specific storage functions for numerical data: numpy arrays are stored via the numpy.save function, which is also used to store NumpyDataContainer objects. Another specific storage function takes care of xarray.DataArray and XrDataContainer objects.

If there is no specific storage function available, it is attempted to pickle the object.

Note

It is not currently possible to store BaseDataGroup-derived objects in the file cache.

Remarks#

  • The structure of the DAG – a Merkle tree, or: hash tree – ensures that each node’s hash depends on all parent nodes’ hashes. Thus, all downstream hashes will change if some early operation’s arguments are changed.

  • The transformation framework can not distinguish between arguments that are relevant for the result and those who might not; all arguments are taken into account in computing the hash.

  • It might not always make sense to read from or write to the cache, depending on how long it took to compute, how much data is to be stored and loaded and how long that takes.

  • Dividing up large transformations into many small transformations will increase the possibility of cache hits; however, this also increases the memory footprint of the DAG by potentially requiring more memory for intermediate objects and more read/write operations to the file cache.

  • There may never be more than one file in the cache directory that has the same basename (i.e.: hash) as another file. Such situations need to be resolved manually by deleting all but one of the corresponding files.

  • There is no harm in just deleting the cache directory, e.g. when it gets too large.