Data Transformation Framework#
The uniform structure of the dantro data tree is the ideal starting point to allow more general application of transformation on data.
This page describes dantro’s data transformation framework, revolving around the TransformationDAG
class.
It is sometimes also referred to as DAG framework or data selection and transformation framework and finds application in the plotting framework.
This page is an introduction to the DAG framework and a description of its inner workings. To learn more about its practical usage, make sure to look at the Data Transformation Examples.
Related pages:
Overview#
The purpose of the transformation framework is to be able to generally apply mathematical operations on data that is stored in a dantro data tree. Specifically, it makes it possible to define transformations without touching actual Python code. To that end, a meta language is defined that makes it possible to define almost arbitrary transformations.
In dantro terminology, a transformation is defined as a set consisting of an operation and some arguments.
Say, for example, we want to perform a simple addition of two quantities, 1 and 2, we are used to writing 1 + 2
.
To define a transformation using the meta language, this would translate to a set consisting of the add
operation and two (ordered) arguments: 1
and 2
.
Now, typically transformations don’t come on their own and are not nearly as trivial as the above.
You might desire to compute a + b
, where both a
and b
are results of previous transformations.
This can be represented as a directed acyclic graph, or short: DAG. For the example above, the graph is rather small:
a:(…) b:(…)
^ ^
\ /
\ /
(add, a, b)
The nodes in this graph represent transformations.
These nodes can have labels, e.g. a
and b
, which are called references or tags in dantro terminology.
As illustrated by the example above, the tags can be used in place of arguments to denote that the result of a previous transformation (with the corresponding label) should be used.
The directed edges in the graph represent dependencies. The acyclic in DAG is required such that the computation of a transformation result does not end in an infinite loop due to a circular dependency.
The Transformation
and TransformationDAG
dantro classes implement exactly this structure, making the following features available:
Easy and generic access to data stored in an associated
DataManager
Definition of arbitrary DAGs via dictionary-based configurations
Syntax optimized to make specification via YAML easy
Shorthand notations available
New and custom operations can be registered
There are no restrictions on the signature of operations
Caching of transformations is possible, avoiding re-calculation of computationally expensive transformations
Transformations are uniquely representable by a hash
The Transformation Syntax#
This section will guide you through the syntax used to define transformations. It will explain the basic elements and inner workings of the mini-language created for the purpose of the DAG.
Note
This explanation goes into quite some detail; and it’s quite important to understand the underlying structures of the If you feel like you would like to jump ahead to see what awaits you, have a look at the Minimal Syntax.
The TransformationDAG
#
The structure a user (you!) is mainly interacting with is the TransformationDAG
class.
It takes care to build the DAG by creating Transformation
objects according to the specification you provided.
In the following, all YAML examples will represent the arguments that are passed to the TransformationDAG
during initialization.
Basics#
Ok, let’s start with the basics: How can transformations be defined? For the sake of simplicity, let’s only look at transformations that are fully independent of other transformations.
Explicit syntax#
The explicit syntax to define a single Transformation
via the TransformationDAG
looks like this:
transform:
- operation: add
args: [1, 2]
kwargs: {}
The transform
argument is the main argument to specify transformations.
It accepts a sequence of mappings.
Each entry of the sequence contains all arguments that are needed to create a single Transformation
.
As you see, the syntax is very close to the above definition of what a dantro transformation contains.
Note
The args
and kwargs
arguments can also be left out, if no positional or keyword arguments are to be passed, respectively.
This is equivalent to setting them to ~
or empty lists / dicts.
Specifying multiple transformations#
To specify multiple transformations, simply add more entries to the transform
sequence:
transform:
- operation: add
args: [3, 4]
- operation: sub
args: [8, 2]
- operation: mul
args: [6, 7]
Advanced Referencing#
In the examples above, all transformations were independent of each other. Having completely independent and disconnected nodes, of course, defeats the purpose of having a DAG structure.
Now let’s look at proper, non-trivial DAGs, where individual transformations use the results of other transformations.
Referencing other Transformations#
Other transformations can be referenced in three ways, each with a corresponding Python class and an associated YAML tag:
DAGReference
and!dag_ref
: This is the most basic and most explicit reference, using the transformations’ hash to identify a reference.DAGTag
and!dag_tag
: References by tag are the preferred references. They use the plain text name specified via thetag
key.DAGNode
and!dag_node
: Uses the ID of the node within the DAG. Mostly for internal usage!
Note
When the DAG is built, all references are brought into the most explicit format: DAGReference
s.
Thus, internally, the transformation framework works only with hash references.
The best way to refer to other transformations is by tag: there is no ambiguity, it is easy to define, and it allows you to easily build a DAG tree structure. A simple example with three nodes would be the following:
transform:
- operation: add
args: [3, 4]
tag: some_addition
- operation: sub
args: [8, 2]
tag: some_substraction
- operation: mul
args:
- !dag_tag some_addition
- !dag_tag some_substraction
tag: the_answer
Which is equivalent to:
some_addition = 3 + 4
some_substraction = 8 - 2
the_answer = some_addition * some_substraction
References can appear within the positional and the keyword arguments of a transformation. As you see, they behave quite a bit like variables behave in programming languages; the only difference being: you can’t reassign a tag and you should not form circular dependencies.
Using the result of the previous transformation#
When chaining multiple transformations to each other and not being interested in the intermediate results, it is tedious to always define tags:
transform:
- operation: mul
args: [1, 2]
tag: f2
- operation: mul
args: [!dag_tag f2, 3]
tag: f3
- operation: mul
args: [!dag_tag f3, 4]
tag: f4
- operation: mul
args: [!dag_tag f4, 5]
tag: f5
Let’s say, we’re only interested in f5
.
The only thing we want is that the result from the previous transformation is carried on to the next one.
The with_previous_result
feature can help in this case: It adds as the first positional argument a reference to the previous node.
Thus, it is no longer necessary to define a tag.
transform:
- operation: mul
args: [1, 2]
- operation: mul
args: [3]
with_previous_result: true
- operation: mul
args: [4]
with_previous_result: true
- operation: mul
args: [5]
with_previous_result: true
tag: f5
Note that the args
, in that case, specify one fewer positional argument.
Warning
Using !dag_node
in your specifications is not recommended.
Use it only if you really know what you’re doing.
In case the result of the previous transformation should not be used in place of the first positional argument but somewhere else, there is the !dag_prev
YAML tag, which creates a node reference to the previous node:
transform:
- operation: define
args: [10]
- operation: sub
args: [0, !dag_prev ]
- operation: div
args: [1, !dag_prev ]
- operation: power
args: [10, !dag_prev ]
tag: my_result
Note
Notice the space behind !dag_prev
.
The YAML parser might complain about a character directly following the tag, like …, !dag_prev]
.
Computing Results#
To compute the results of the DAG, invoke the TransformationDAG
‘s compute()
method.
It can be called without any arguments, in which case the result of all tagged transformations will be computed and returned as a dict. If only the result of a subset of tags should be computed, they can also be specified.
Computing results works as follows:
Each tagged
Transformation
is visited and its owncompute()
method is invokedA cache lookup occurs, attempting to read the result from a memory or file cache.
The transformations resolve potential references in their arguments: If a
DAGReference
is encountered, the correspondingTransformation
is resolved and that transformation’scompute()
method is invoked. This traverses all the way up the DAG until reaching the root nodes which contain only basic data types (that need no computation).Having resolved all references into results, the arguments are assembled, the operation callable is resolved, and invoked by passing the arguments.
The result is kept in a memory cache. It can additionally be stored in a file cache to persist to later invocations.
The result object is returned.
Note
Only nodes that are tagged can be part of the results. Intermediate results still need to be computed, but it will not be part of the results dict. If you want an intermediate result to be available there, add a tag to it.
This also means: If there are parts of the DAG that are not tagged at all, they will not be reached by any recursive argument lookup.
Hint
Use the compute_only
argument of compute()
to specify which tags are to be computed.
If not given, all tags will be computed, unless they start with a .
or _
(these are so-called “private” tags).
To compute private tags directly, include them in compute_only
.
You can also force computation of a node, even if untagged, by adding force_compute
to the transformation, see below.
Hint
To learn which parts of the computation require the most time, e.g. in order to evaluate whether to cache the result, inspecting the DAG profile statistics can be useful.
The TransformationDAG
‘s verbosity
attribute controls how extensively statistics are written to the log output.
By default (verbosity 1
), only per-node statistics are emitted.
For levels >= 2
, per-operation statistics are shown alongside.
Resolving and applying operations#
Let’s have a brief look into how the operation
argument is actually resolved and how the operation is then applied.
This feature is not specific to the DAG, but the DAG uses the data_ops
module, which implements a database of available operations and the apply_operation()
function to apply an operation.
Basically, this is a thin wrapper around a function lookup and its invocation.
For a full list of available data operations, see here.
Hint
You can also use the import
operation to retrieve a callable (or any other object) via a Python import and then use the call
operation to invoke it.
These two operations are combined in the import_and_call
operation:
transform:
- operation: import_and_call
args: [numpy.random, randint]
kwargs:
low: 0
high: 10
size: [2, 3, 4]
To specifically register additional operations, use the register_operation()
function.
This should only be done for operations that are not easily usable via the import
and call
operations.
Forcing computation of individual nodes#
Sometimes, it’s useful to force the computation of an individual node.
To that end, simply set the force_compute
option for a transformation:
transform:
- operation: div
args: [1, 0]
force_compute: true
In this example, the node will always be computed, obviously leading to a ZeroDivisionError
.
A few remarks:
Force-computed tags are computed before the tags specified in
compute_only
.Typical use case is during debugging, where you want to make sure that an operation really is carried out.
Unlike tagged nodes, their results are not available in the results dict. Even if the node is tagged, it will only appear in the results dict if it is part of
compute_only
.
Selecting from the DataManager
#
The above examples are trivial in that they do not use any actual data but define some dummy values.
This section shows how data can be selected from the DataManager
that is associated with the TransformationDAG
.
The process of selecting data is not different than other transformations.
It makes use of the getitem
operation that would also be used for regular item access, and it uses the fact that the data manager is available via the dm
tag.
Note
The DataManager
is also identified by a hash, which is computed from its name and its associated data directory path.
Thus, managers for different data directories have different hashes.
The select
interface#
As selecting data from the DataManager
is a common use case, the TransformationDAG
supports the select
argument besides the transform
argument.
The select
argument expects a mapping of tags to either strings (the path within the data tree) or further mappings (where more configurations are possible):
select:
some_data: path/to/some_data
more_data:
path: path/to/more_data
# ... potentially more kwargs
transform: ~
The results dict will then have two tags, some_data
and more_data
, each of which is the selected object from the data tree.
Note
The above example is translated into the following basic transformation specifications:
transform:
- operation: getitem
args: [!dag_tag dm, path/to/more_data]
tag: more_data
- operation: getitem
args: [!dag_tag dm, path/to/some_data]
tag: some_data
Note that the order of operations is sorted alphabetically by the tag specified under the select
key.
Directly transforming selected data#
Often, it is desired to apply some sequential transformations to selected data before working with it.
As part of the select
interface, this is also possible:
select:
square_increment:
path: path/to/some_data
with_previous_result: true
transform:
- operation: squared
- operation: increment
some_sum:
path: path/to/more_data
transform:
- operation: getattr
args: [!dag_prev , data]
- operation: sub
args: [0, !dag_prev ]
- operation: .sum
args: [!dag_prev ]
transform:
- operation: add
args: [!dag_tag square_increment, !dag_tag some_sum ]
tag: my_result
Notice the difference between square_increment
, where the result is carried over, and some_sum
, where the reference has to be specified explicitly.
As visible there, within the select
interface, the with_previous_result
option can also be specified such that it applies to a sequence of transformations that are based on some selection from the data manager.
Note
The parser expands this syntax into a sequence of basic transformations.
It does so before any other transformations from the transform
argument are evaluated. Thus, whichever tags are defined there are not available from within select
!
Changing the selection base#
By default, selection happens from the associated DataManager
, tagged dm
.
This option can be controlled via the select_base
property, which can be set both as argument to __init__
and afterward via the property.
The property expects either a DAGReference
object or a valid tag string.
If set, all following select
arguments are using that reference as the basis, leading to getitem
operations on that object rather than on the data manager.
As the select
arguments are evaluated before any transform operations, only the default tags are available during initialization.
To widen the possibilities, the TransformationDAG
allows the base_transform
argument during initialization; this is just a sequence of transform specifications, which are applied before the select
argument is evaluated, thus allowing to select some object, tag it, and use that tag for the select_base
argument.
Note
The select_path_prefix
argument offers similar functionality, but merely prepends a path to the argument.
If possible, the select_base
functionality should be preferred over select_path_prefix
as it reduces lookups and cooperates more nicely with the file caching features.
Background Information
Internally, when the select
specification is evaluated, it is set to select against a special tag select_base
; by default, this is the same as the dm
special tag.
Effectively, the select
feature always selects starting from the object the select_base
property points to at the time the nodes are added to the DAG.
In other words, if the select_base
is changed after the nodes were added, this will not have any effect.
For meta-operations this means that the base of selection is not relevant at definition of the meta-operations; the base gets evaluated when the meta-operation is used.
The define
interface#
So far, we have seen two ways to add transformation nodes to the DAG: via transform
or via select
.
These are based either on directly adding the nodes, giving full control, or adding transformations based on a selection of data.
The define
interface is a combintion of these two approaches: same as select
, it revolves around the final tag that’s meant to be attached to the definition, but it does not require a data selection like select
does.
Let’s look at an example that combines all these ways of adding transformations:
define:
exponent: 4 # directly define some object
days_to_seconds_factor: # use a sequence of TransformationsDAG
- expression: "60 * 60 * 24"
- float
select:
some_data: path/to/some_data
more_data: path/to/more_data
transform:
- add: [!dag_tag some_data, !dag_tag more_data]
- mul: [!dag_prev , !dag_tag days_to_seconds_factor]
- print
- pow: [!dag_prev , !dag_tag exponent]
tag: my_result
Here, the exponent
as well as some conversion factor tags are defined not ad-hoc but separately via the define
interface.
As can be seen in the example, there are two ways to do this:
If providing a
list
ortuple
type, it is interpreted as a sequence of transformations, accepting the same syntax astransform
. After the final transformation, another node is added that sets the specified tag,days_to_seconds_factor
in this example.If prodiving any other type, it is interpreted directly as a definition, adding a single transformation node that holds the given argument, the integer
4
in the case of theexponent
tag.
Note
The define
argument is evaluated before the other two.
Subsequently, tags defined via define
can be used within select
or transform
, but not the other way around.
Hint
In the context of plotting, the define
interface has an important benefit over the select
and transform
syntax for adding nodes to the DAG:
It is dictionary-based, which makes it very easy to recursively update its content; this is very useful for Plot Configuration Inheritance.
⚠️ Note: When using the DAG for plot data selection, not all arguments can be exposed on the top-level of the plot configuration; the define
argument is one of the arguments that is nested in a DAG-specific config entry dag_options
:
my_plot:
dag_options:
define:
exponent: 4
# ...
select:
# ...
transform:
- # ...
Individually adding nodes#
Nodes can be added to TransformationDAG
during initialization; all the examples above are written in that way.
However, transformation nodes can also be added after initialization using the following two methods:
add_node()
adds a single node and returns its reference.add_nodes()
adds multiple nodes, allowing thedefine
,select
, andtransform
arguments in the same syntax as during initialization. Internally, this parses the arguments and callsadd_node()
.
Minimal Syntax#
To make the definition a bit less verbose, there is a so-called minimal syntax, which is internally translated into the explicit and verbose one documented above. This can make DAG specification much easier:
select:
some_data: path/to/some_data
more_data: path/to/more_data
transform:
- add: [!dag_tag some_data, !dag_tag more_data]
- increment
- print
- pow: [!dag_prev , 4]
tag: my_result
This DAG will have three custom tags defined: some_data
, more_data
and my_result
.
Computation of the my_result
tag is equivalent to:
my_result = ((some_data + more_data) + 1) ** 4
As can be seen above, the minimal syntax gets rid of the operation
, args
and kwargs
keys by allowing to specify it as <operation name>: <args or kwargs>
or even as just a string <operation name>
, without further arguments.
With arguments, <operation name>: <args or kwargs>
#
When passing a sequence (e.g. [foo, bar]
) the arguments are interpreted as positional arguments; when passing a mapping (e.g. {foo: bar}
), they are treated as keyword arguments.
Hint
In this shorthand notation it is still possible to specify the respective “other” types of arguments using the args
or kwargs
keys.
For example:
transform:
- my_operation: [foo, bar]
kwargs: { some: more, keyword: arguments }
- my_other_operation: {foo: bar}
args: [some, positional, arguments]
Without arguments, <operation name>
#
When specifying only the name of the operation as a string (e.g. increment
and print
), it is assumed that the operation accepts only a single positional argument and no other arguments.
That argument is automatically filled with a reference to the result of the previous transformation, i.e.: the result is carried over.
For example, the above transformation with the increment
operation would be translated to:
operation: increment
args: [!dag_prev ]
kwargs: {}
tag: ~
Operation Hooks#
The DAG syntax parser allows attaching additional parsing functions to operations, which can help to supply a more concise syntax.
These so-called operation hooks are described in more detail here.
As an example, the expression
operation can be specified much more conveniently with the use of its hook.
Taking the example from above, the same can be expressed as:
select:
some_data: path/to/some_data
more_data: path/to/more_data
transform:
- expression: (some_data + more_data + 1) ** 4
tag: my_result
In this case, the hook automatically extracts the free symbols (some_data
and more_data
) and translates them to the corresponding DAGTag
objects.
Effectively, it parses the above to:
select:
some_data: path/to/some_data
more_data: path/to/more_data
transform:
- expression: (some_data + more_data + 1) ** 4
kwargs:
symbols:
some_data: !dag_tag some_data
more_data: !dag_tag more_data
tag: my_result
If you care to deactivate a hook, set the ignore_hooks
flag for the operation:
operation: some_hooked_operation
args: [foo, bar]
ignore_hooks: true
Warning
Failing operation hooks will emit a logger warning, informing about the error; they do not raise an exception. While this might not lead to a failure during parsing, it might lead to an error during computation, e.g. when you are relying on the hook to have adjusted the operation arguments.
Depending on the operation arguments, there can be cases where the hook will not be able to perform its function because it lacks information that is only available after a computation. In such cases, it’s best to deactivate the hook as described above.
Graph representation and visualization#
The TransformationDAG
has the ability to represent the internally used directed acyclic graph as a networkx.classes.digraph.DiGraph
.
By calling the generate_nx_graph()
method, the Transformation
objects are added to a graph and the dependencies between these transformations are added as directed edges.
This can help to better understand the generated DAG and is useful not only for debugging but also for optimization, as it allows to show the associated profiling information.
Hint
It can be configured whether the edges should represent the “flow” of results through the DAG (edges pointing towards the node that requires a certain result) or whether they should point towards a node’s dependency.
By default, generate_nx_graph()
has edges_as_flow
set to True
, thus having edges point in the effective direction of computation.
Visualization#
In addition to generating the graph object, the visualize()
method can generate a visual output of the DAG:
In this example, the - my_result -
node is tagged at the bottom and the arrows come from the transformations that these operations depend on.
Effectively, calculation starts at the top, with data being read from the dm
node, the associated DataManager
, then following the arrows towards the my_result
node and applying the specified operations like squared
, increment
and so on.
The circles in the background show the status of the computation, green meaning that a node’s result was computed as expected; other colors and their corresponding status are detailed in the legend.
The node status can indicate where in a DAG computation routine an error occurred.
To control this, have a look at the show_node_status
argument and the annotation_kwargs
, where the legend can be controlled.
Note
Operation arguments cannot easily be shown as it would quickly become too cluttered. For that reason, the visualization typically restricts itself to showing the operation name, the result (if computed), and the tag (if set).
See visualize()
for more info.
Hint
If using the data transformation framework for plot data selection, visualization is deeply integrated there; see DAG Visualization.
Hint
DAG visualization works much better with pygraphviz installed, because it gives access to more capable layouting algorithms.
Export#
To post-process the DAG data elsewhere, use the standalone export_graph()
function.
Full syntax specification of a single transformation node#
To illustrate the possible arguments for creating a transformation node via add_node()
, the following block contains a full specification of available keys and arguments.
It is a combination of arguments to Transformation
and arguments that are handled by TransformationDAG
, which is aware of the whole DAG.
Note that this is the explicit representation, which is a bit verbose.
Except for operation
, args
, kwargs
and tag
, all entries are set to default values.
operation: some_operation # The name of the operation
args: # Positional arguments
- !dag_tag some_result # Reference to another result
- second_arg
kwargs: # Keyword arguments
one_kwarg: 123
another_kwarg: foobar
salt: ~ # Is included in the hash; set a value here
# if you would like to provoke a cache miss
fallback: ~ # May only be given if ``allow_failure`` is
# also set, in which case it specifies a
# fallback value (or reference) to use
# instead of the operation result.
# All arguments _below_ are NOT taken into account when computing the hash
# of this transformation. Two transformations that differ _only_ in the
# arguments given below are considered equal to each other.
tag: my_result # The tag of this transformation. Optional.
force_compute: ~ # Used to force computation of this node
# without needing to assign a tag.
allow_failure: ~ # Whether to allow this transformation to
# fail during computation or resolution of
# the arguments (i.e.: upstream error).
# Special options are: log, warn, silent
memory_cache: True # If False, will not keep the computed
# result in memory but either re-compute it
# or load it from the file cache.
file_cache: # File cache options
read: # Read-related options
enabled: false # Whether to read from the file cache
always: false # If true, will always read from the file
# cache, regardless of whether the result
# was already stored in the memory cache
# or just computed.
load_options:
unpack: ~ # Whether to unpack the result from the
# dantro container it was loaded into.
# If None, will do so only for numeric
# types (xarray and numpy arrays)
# ... further arguments are passed on to DataManager.load
write: # Write-related options
enabled: false # Whether to write to the file cache
# If writing is enabled, the following options determine whether a
# cache file should actually be written (does not always make sense)
always: false # If true, skips other conditions below and
# ensures that a cache file is created.
# NOTE: This will not *overwrite* an
# existing cache file by default; see the
# ``allow_overwrite`` parameter for that.
allow_overwrite: false # If false, will not write if a cache file
# already exists (even with ``always`` set)
min_size: ~ # If given, the result needs to have at
# least this size (in bytes) for it to be
# written to a cache file.
max_size: ~ # Like min_size, but upper boundary
min_compute_time: ~ # If given, a cache file is only written
# if the computation time of this node on
# its own, i.e. without the computation
# time of the dependencies exceeded this
# value.
min_cumulative_compute_time: ~ # Like min_compute_time, but actually
# taking into account the time it took to
# compute results of the dependencies.
# Options used when storing a result in the cache
storage_options:
raise_on_error: false # Whether to raise if saving failed
attempt_pickling: true # Whether to attempt pickling if saving
# via a specific save function failed
pkl_kwargs: {} # Passed on to pkl.dumps
ignore_groups: true # Whether to attempt storing dantro groups
# ... additional arguments passed on to the specific saving function
Note
This does not reflect any arguments made available by the DAG parser!
Features like the minimal syntax or the operation hooks are handled prior to the initialization of a Transformation
object.
Hint
Often the easiest way to learn is by example. Make sure to check out the Data Transformation Examples page, where you will find practical examples that go beyond what is shown here.
Meta-Operations#
In essence, the transformation framework, as described above, can be used to define a sequence of data operations, just like a sequential program code could do. Now, what if parts of these operations are used multiple times? In a typical computer program, one would define a function to modularize part of the program. The equivalent construct in the data transformation framework is a so-called meta-operation, which can be characterized in the following way:
It can have input arguments that define which objects it should work on
It consists of a number of operations that transform the arguments in the desired way
It has one (and only one) output, the return value
How are meta-operations defined?
Meta-operations can be defined in just the same way as regular transformations are defined, with some additional syntax for defining positional arguments (args
) and keyword arguments (kwargs
).
Let’s look at an example:
# Define meta-operations
meta_operations:
# Compute (x**2 + 1) for a positional input argument x
square_plus_one:
- pow: [!arg 0, 2]
- add: [!dag_prev , 1] # <-- this operation's result is the "return value"
# Select some data and directly compute its mean
select_and_compute_mean:
select:
data:
path: !kwarg to_select
transform:
- .mean: !dag_tag data
This defines two meta-operations: square_plus_one
(with one positional argument) and select_and_compute_mean
(with the to_select
keyword argument).
During initialization of TransformationDAG
, these can be passed using the meta_operations
argument.
How are meta-operations used?
In exactly the same way as all regular data operations: simply define their name as the operation
argument of a transformation.
# Use the meta-operations within the regular data transformation, alongside
# the already-existing data operations
transform:
- select_and_compute_mean:
to_select: path/to/some_data
tag: some_data_mean
- select_and_compute_mean:
to_select: path/to/more_data
tag: more_data_mean
- add: [!dag_tag some_data_mean, !dag_tag more_data_mean]
- square_plus_one: !dag_prev
tag: result
Here, one of the meta-operations is used to compute two mean values from a selection; these are then added together via the regular add
operation; finally, the other meta-operation is applied to that sum, yielding the result.
While the individual meta-operations are not complex in themselves, this illustrates how repeatedly invoked transformations can be modularized.
Note
The examples in these sections use the meta_operations
top-level entry to illustrate the definition of meta-operations.
The transform
and/or select
top-level entries are used to denote how meta-operations can be invoked (in the same way as regular operations).
As a brief summary:
Meta-operations are defined via the
meta_operations
argument ofTransformationDAG
, using the same syntax as for other transformations.They can specify positional and keyword arguments and have a return value.
They can be used for the
operation
argument of any transformation, same as other available data operations.Meta-operations allow modularization and thereby simplify the definition of data transformations.
Hint
To use meta-operations for plot data selection, define them under the dag_options.meta_operations
key of a plot configuration.
Defining meta-operations#
The example above already gave a glimpse into how to define meta-operations.
In many ways, this works exactly the same as defining transformations, e.g. under the transform
argument.
Specifying arguments#
Like Python functions, meta-operations can have two kinds of arguments:
Positional arguments, defined using the
!arg <position>
YAML tagKeyword arguments, defined using the
!kwarg <name>
YAML tag
These can be used anywhere inside the meta-operation specification and serve as placeholders for expected arguments. Let’s look at an example:
meta_operations:
my_equation: # [(a + b) * c - d] / e
- add: [!kwarg a, !kwarg b]
- mul: [!dag_prev , !kwarg c]
- sub: [!dag_prev , !kwarg d]
- div: [!dag_prev , !kwarg e]
transform:
- my_equation:
a: 1
b: 10
c: 8
d: 4
e: 2
tag: the_answer
When the meta-operation gets translated into nodes, the corresponding positional and keyword arguments are replaced with the values from the args
and kwargs
of the transformation specification.
Some remarks:
Positional and keyword arguments can be mixed
Arguments can be referred to multiple times within a meta-operation definition
The set of positional arguments, if specified, needs to include all integers between zero and the highest defined
!arg <position>
.Optional arguments and variable positional or keyword arguments are not supported (yet).
Return values#
Meta-operations always have one and only one return value: the last defined transformation.
Hint
To have “multiple” return values, e.g. to return an intermediate result, aggregate objects into a dict
that can then be unpacked outside of the meta-operation.
For an example, see Aggregate return values.
Using select
within meta-operations#
The definitions inside meta_operations
can have two possible formats:
meta_operations:
# A -- as list ==> transformations only
a_plus_b_cubed:
- add: [!arg 0, !arg 1]
- pow: [!dag_prev , 3]
# B -- as dict ==> selections _and_ transformations
select_and_square:
select:
data:
path: !kwarg to_select
transform:
- squared: !dag_tag data
transform:
- a_plus_b_cubed: [1, 2]
tag: result1
- select_and_square:
to_select: path/to/some_data
tag: result2
As can be seen above, the dict-based definition supports using the select interface.
Importantly, this supports parametrization: simply use !arg
or !kwarg
inside the select
specification, e.g. to make the path of the to-be-selected object an argument to the meta-operation.
Argument default values#
Meta operation arguments !arg
and !kwarg
can also have default values.
These are defined by passing a list of length 2 to the YAML tags (instead of a scalar number for positional arguments or name for keyword arguments).
For instance, if you want an optional keyword argument foo
, define it as:
!kwarg [foo, my_default_value]
Equivalently for positional arguments:
!arg [0, my_default_value]
Let’s look at an example where the my_increment
meta-operation would increment by one per default or by some other value, if desired:
meta_operations:
my_increment:
- add: [!arg 0, !arg [1, 1]]
transform:
- my_increment: [0]
tag: one
- my_increment: !dag_prev
tag: two
- my_increment: [!dag_prev , 8]
tag: ten
The above meta-operation is equivalent to the following Python function with one required positional-only argument and one optional positional-only argument:
def my_increment(x, delta = 1, /):
return x + delta
one = my_increment(0)
two = my_increment(one)
ten = my_increment(two, 8)
For a larger example that is using keyword arguments, see below.
Hint
Default values need not be scalar, they can be anything — as long as they do not contain any Placeholder
objects like tags, references, or other argument definitions.
Warning
To clearly distinguish which arguments are optional and which ones are required, make sure that any !arg
or !kwarg
with a default value has a default value for all those occurrences of the arguments in your meta-operation:
There should never be a !arg [0, 42]
and !arg 0
in your meta-operation at the same time.
Examples#
prime_multiples
#
The following example performs operations on the arguments and then uses internal tags (!dag_tag
) to connect their output to a result.
meta_operations:
prime_multiples:
# Define powers of primes 2, 3, 5, and 7
- pow: [2, !kwarg [base2, 0]]
tag: b2
- pow: [3, !kwarg [base3, 0]]
tag: b3
- pow: [5, !kwarg [base5, 0]]
tag: b5
- pow: [7, !kwarg [base7, 0]]
tag: b7
# Compute their product
- np.: [prod, [!dag_tag b2, !dag_tag b3, !dag_tag b5, !dag_tag b7]]
transform:
- prime_multiples:
base2: 2
base3: 1
# base5: 0
base7: 3
tag: result
As can be seen in the following plot, the meta-operation is unpacked into individual transformation nodes:
Hint
The DAG visualization also shows which operation
originated from which meta-operation (in parentheses below the operation name).
Here, all originate from prime_multiples
.
Aggregate return values#
In this example, a dict
operation is used to return multiple results from a meta-operation.
meta_operations:
# Given some input data (as positional argument), compute a bunch of
# statistical quantities. To return them, aggregate them into a dict.
compute_stats:
# Make sure it's an xarray object
- xr.DataArray: !arg 0
tag: data
# Compute the statistics
- .mean: !dag_tag data
tag: mean
- .std: !dag_tag data
tag: std
- .median: !dag_tag data
tag: median
- .min: !dag_tag data
tag: min
- .max: !dag_tag data
tag: max
- .quantile: [!dag_tag data, 0.25]
tag: q25
- .quantile: [!dag_tag data, 0.75]
tag: q75
# Aggregate into a dict as return value
- dict:
mean: !dag_tag mean
std: !dag_tag std
median: !dag_tag median
min: !dag_tag min
max: !dag_tag max
q25: !dag_tag q25
q75: !dag_tag q75
# Usage example: select some data and get some of the desired statistics
select:
some_data: path/to/some_data
transform:
- compute_stats: !dag_tag some_data
tag: some_stats
- getitem: [!dag_tag some_stats, mean]
tag: some_mean
- getitem: [!dag_tag some_stats, std]
tag: some_std
- getitem: [!dag_tag some_stats, median]
tag: some_median
Note that by aggregating results into an object, the DAG will not be able to discern whether a branch of the compute_stats
meta-operation is actually needed, thus potentially computing more results than required.
In order to avoid computing more nodes than necessary, aggregated return values should be used sparingly; ideally, use them only to return an intermediate result.
This packing and unpacking can also be observed in the DAG plot:
my_gauss
#
This example shows how to define a mathematical expression (also see: operation hooks) and exposing its symbols as arguments of the meta-operation:
meta_operations:
# A meta-operation that defines a gaussian
my_gauss:
- expression: a * exp(- (x - mu)**2 / (2 * sigma**2))
kwargs:
symbols:
x: !kwarg x
a: !kwarg a
mu: !kwarg mu
sigma: !kwarg sigma
transform:
# Compute the Gaussian for two values
- my_gauss:
a: 1.
mu: 0.
sigma: 1.
x: 0.
tag: default_gaussian
- my_gauss:
a: 1.
mu: 23.
sigma: 10.
x: 23.
tag: wide_gaussian_moved
For this case, it makes a lot of sense to use default values for meta-operation arguments, thus reducing the number of keyword arguments that need to be specified:
meta_operations:
# A meta-operation that defines a gaussian
my_gauss:
- expression: a * exp(- (x - mu)**2 / (2 * sigma**2))
kwargs:
symbols:
x: !kwarg x
a: !kwarg [a, 1.]
mu: !kwarg [mu, 0.]
sigma: !kwarg [sigma, 1.]
transform:
# Compute the Gaussian for two values
- my_gauss:
x: 0.
tag: default_gaussian
- my_gauss:
x: 23.
a: 1.
mu: 23.
sigma: 10.
tag: wide_gaussian_moved
Hint
If you do not want to define default arguments, e.g. because you want to control the shared defaults via some YAML-based logic, you can also reduce the number of repeated arguments using YAML anchors and inheritance:
transform:
- my_gauss: &my_gauss_defaults # <-- defines the defaults
a: 1.
mu: 0.
sigma: 1.
x: 0.
tag: default_gaussian
- my_gauss:
<<: *my_gauss_defaults # <-- re-use defaults ...
a: 10. # ... and update with new values
tag: scaled_gaussian
- my_gauss:
<<: *my_gauss_defaults
mu: -42.
tag: moved_gaussian
Remarks & Caveats#
Note the following remarks regarding the definition and use of meta-operations:
Inside meta-operations, no outside tags except the “special” tags (
dag
,dm
,select_base
) can be used. Further inputs should be handled by adding arguments to the meta-operation as described above.When using the
select
syntax in the definition of a meta-operation and aiming to define an argument, note that the long syntax needs to be used:select: # Correct some_data: path: !kwarg some_data_path # WRONG! Will not work. other_data: !kwarg other_data_path
When defining a meta-operation and using an operation that makes use of an operation hook, the tags created by the hook need to be explicitly exposed as arguments, otherwise there will be an
Unused tags ...
error. To expose them, there are two ways:Use them internally by adding a
define
,dict
, orlist
operation prior to the operation that uses the hook; then explicitly specify them as arguments there.In the case of the
expression
operation hook, use thekwargs.symbols
entry to directly define them as arguments, as done in themy_gauss
example above.
A meta-operation always adds a so-called “result node”, which uses the
pass
operation to make the result of the meta-operation available. When using a meta-operation, the argumentstag
andfile_cache
(see below) as well as any error handling arguments are added only to this result node. For all other transformation nodes of a meta-operation, the following holds:They may have only internal tags attached
They may define their own
file_cache
behavior; if they do not, the default values for file caching are used.They are free to define their own error handling behavior.
Error Handling#
Operations are not always guaranteed to succeed.
To define more robust operations, some form of error handling is required, akin to try-except
blocks in Python.
In the data transformation framework, the allow_failure
option handles failing data operations and allows to specify a fallback
value that should be used as result in case the operation failed.
Let’s have a look:
- float: "inf"
- div: [1, 0] # 1 / 0 --> raises ZeroDivisionError
allow_failure: true
fallback: !dag_prev
tag: result
Here, the ZeroDivisionError
is avoided and, instead, the value of the previous node (which defines a float infinity value) is used.
Subsequently, the result
will be the Python floating-point inf
.
Note
The allow_failure
argument also accepts a few string-like values which control the verbosity of the output in case of failure:
log
does the same asTrue
: print a prominent log message that informs about the failed operation and the use of the fallback.
warn
: emits a Python warning
silent
: suppresses the message altogether
Example:
- float: "inf"
- div: [1, 0]
allow_failure: silent # can also be: True, log, warn, False
fallback: !dag_prev
For debugging, make sure to not use silent
.
Hint
The fallback
argument accepts not only scalars, but also sequences or mappings, which in turn may contain !dag_tag
references.
Upstream errors#
Sometimes, an error only becomes relevant in a later operation and it makes sense to defer error handling to that point. The analogy to Python exception handling would be to handle the error not directly where it occurs but in an outside scope.
This is also possible within the error handling framework, because allow_failure
pertains to both the computation of the specified operation as well as the resolution of its arguments.
As the resolution of arguments triggers the computation of dependent nodes (and their dependencies, and so forth), an upstream error may also be caught in a downstream node:
# Example input: assume that this may also be the output from previous
# operations which are used to calculate something else ...
- define: -1.23
tag: some_value
- define: +1
tag: some_other_value
# Perform some potentially problematic operations with these ...
- import_and_call: [math, log10, !dag_tag some_value] # --> ValueError
tag: log10_value
- import: [np, pi]
tag: pi
- sub: [!dag_tag some_other_value, 1.]
- div: [!dag_tag pi, !dag_prev ] # --> ZeroDivisionError
tag: pi_over_some_other_value
# ... leading to the result
- add: [!dag_tag log10_value, !dag_tag pi_over_some_other_value]
allow_failure: true
fallback: 42
tag: my_result
In this example, the nodes tagged log10_value
and pi_over_some_other_value
are both problematic but do not specify any error handling.
However, we may only be interested in my_result
, which depends on those two transformation results.
Let’s say, we specified compute_only: [my_result]
.
What would happen in such a case?
The transformation tagged
my_result
is looked up in the DAG.The transformation’s arguments are recursively resolved, triggering lookup of the dependencies
log10_value
andpi_over_some_other_value
.The referenced transformations would in turn look up their arguments and finally lead to the application of the problematic operations (
div
andmath.log10
), which will fail for the arguments in this example.An error is raised during those operations.
The error propagates back to the
my_result
transformation.With
allow_failure: true
, the error is caught and the fallback value is used instead.
Warning
The above example only works with compute_only: [my_result]
.
If the problematic tags were to be computed directly, e.g. via compute_only: all
, they would raise an error because they do not specify any error handling themselves.
Note
This example is purely for illustration!
Typically, one would define these operations using numpy and they would not raise exceptions but issue a RuntimeWarning
and use nan
as result.
Error handling within select
#
The select
operation may also specify a fallback.
This fallback will only be applied to the getitem
operation which is used to look up the path
from the specified selection base:
select:
some_data: path/to/some_data
mean_data:
path: some/invalid/path # The underlying `getitem` will fail ...
allow_failure: true # ... but is allowed to.
fallback: [[1, 2, 3]] # Instead, this fallback value is used.
transform:
- np.mean # ... which still works for a mean
transform:
- expression: (some_data + mean_data + 1) ** 4
tag: my_result
Hint
The transform
elements can of course again specify their own fallbacks.
Limitations#
There are some limitations to using allow_failure
within select
.
Mainly, specifying a fallback may be difficult in practice because other tags may not be available yet at the time where the DAG is populated with the select
arguments.
The tags specified by select
are added in alphabetical order and before any transformations from transform
are added to the DAG.
Subsequently, lookups within one select
field are only possible from within select
and for fields that appeared sooner in that alphabetical order.
(See this issue for a potential improvement to this behavior.)
Using a tagged reference in the fallback
works in the following example because '_some_fallback_data' < 'mean_data'
:
select:
_some_fallback_data: path/to/some_data
mean_data:
path: some/invalid/path
allow_failure: true
fallback: !dag_tag _some_fallback_data
transform:
- np.mean
transform:
- expression: (mean_data + 1) ** (-0.5)
tag: my_result
Hint
We advise to not build overly complex fallback structures within select
, e.g. using tagged fallbacks which in turn have tagged fallbacks and so forth.
While possible, it may easily becomes tedious to build or maintain.
If you require more advanced error handling for certain operations, consider wrapping them into your own data operation. See Resolving and applying operations for more information.
The File Cache#
Caching of already computed results is a powerful feature of the TransformationDAG
class.
The idea is, that if some specific computationally expensive transformation already took place previously, it should not be necessary to compute it again.
Background#
To understand the file cache, it’s first necessary to understand the internal handling of transformations.
Within the DAG, each transformation is fully identified by its hash. If the hashes of two transformations are the same it means the operation is the same and all arguments are the same.
All Transformation
objects are stored in an objects
database, which maps a hash to a Python object.
In effect, there is one and only one Transformation
object associated with a certain hash.
Say, a DAG contains two nodes, N1 and N2, with the same hash. Then the object database contains a single transformation T, which is used in place of both nodes N1 and N2. Thus, if the result of one of the nodes is computed, the other should already know the result and not need to re-compute it.
That is what is called the memory cache: once a result is computed, it stays in memory, such that it need not be recomputed again. This is useful not only in the above situation but also when doing DAG traversal during computation.
The file cache is not much different than the memory cache: it aims to make computation results persist to reduce computational resources. With the file cache, the results can persist over multiple invocations of the transformations framework.
Configuration#
Cache directory#
Cache files need to be written in some place.
This can be specified via the cache_dir
argument during the initialization of a TransformationDAG
; see there for details.
By default, the cache directory is called .cache
and is located inside the data directory associated with the DAG’s DataManager.
It is created once it is needed.
Default file cache arguments#
File cache behavior can be configured separately for each Transformation
, as can be seen from the full syntax specification above.
However, it’s often useful to have default values defined that all transformations share.
To do so, pass a dict to the file_cache_defaults
argument.
In the simplest form, it looks like this:
file_cache_defaults:
read: true
write: true
transform:
- # ...
This enables both reading from the cache directory and writing to it.
When passing booleans, to read
and write
, the default behavior is used.
To more specifically configure the behavior, again see the full syntax specification above.
When specifying additional file_cache
arguments within transform
, the values specified there recursively update the ones given by file_cache_defaults
.
Note
The getitem
operations defined via the select
interface always have caching disabled; it makes no sense to cache objects that have been looked up directly from the data tree.
Warning
The file cache arguments are not taken into account for computation of the transformations’ hash. Thus, if there are two transformations with the same hash, only the additional file cache arguments given to the first one are taken into account; the second ones have no effect because the second transformation object is discarded altogether.
Warning
If it is desired to have two transformations with different file cache options, the salt
can be used to perturb its hash and thus force the use of the additional file cache arguments.
Reading from the file cache#
Generally, the best computation is the one you don’t need to make. If there is no result in memory and reading from cache is enabled, the cache directory is searched for a file that has as its basename the hash of the transformation that is to be computed.
If that is the case, the DataManager is used to load the data into the data tree and set the memory cache. (Note that this is Python, i.e. it’s not a copy but the memory cache is a reference to the object in the data tree.)
By default, it is not attempted to read from the cache directory. See above on how to enable it.
Note
When desiring to use the caching feature of the transformation framework, the employed DataManager
needs to be able to load numerical data.
If you are not already using the AllAvailableLoadersMixin
, consider adding NumpyLoaderMixin
, XarrayLoaderMixin
, and PickleLoaderMixin
to your DataManager
specialization.
Hint
Sometimes it can be desired to always read from the file cache, e.g. to make use of the load_options
argument.
In that case, set the following arguments to make sure that a cache file will be written after a computation.
file_cache:
read:
enabled: true
always: true
load_options:
chunks: true
write: true
Note that the computed result may still remain in the memory cache.
See Transformation
on how to not keep it in memory.
Writing to the file cache#
After a computation result was either looked up from the cache or computed, it can be stored in the file cache. By default, writing to the cache is not enabled, either. See above on how to enable it.
When writing a cache file, many options can trigger that a transformation’s result is written to a file. For example, it might make sense to store only results that took a very long time to compute or that are very large.
Once it is decided that a result is to be written to a cache file, the corresponding storage function is invoked. It creates the cache directory, if it does not already exist, and then attempts to save the result object using a set of different storage functions.
There are specific storage functions for numerical data: numpy arrays are stored via the numpy.save
function, which is also used to store NumpyDataContainer
objects.
Another specific storage function takes care of xarray.DataArray
and XrDataContainer
objects.
If there is no specific storage function available, it is attempted to pickle the object.
Note
It is not currently possible to store BaseDataGroup
-derived objects in the file cache.
Remarks#
The structure of the DAG – a Merkle tree, or: hash tree – ensures that each node’s hash depends on all parent nodes’ hashes. Thus, all downstream hashes will change if some early operation’s arguments are changed.
The transformation framework can not distinguish between arguments that are relevant for the result and those who might not; all arguments are taken into account in computing the hash.
It might not always make sense to read from or write to the cache, depending on how long it took to compute, how much data is to be stored and loaded and how long that takes.
Dividing up large transformations into many small transformations will increase the possibility of cache hits; however, this also increases the memory footprint of the DAG by potentially requiring more memory for intermediate objects and more read/write operations to the file cache.
There may never be more than one file in the cache directory that has the same basename (i.e.: hash) as another file. Such situations need to be resolved manually by deleting all but one of the corresponding files.
There is no harm in just deleting the cache directory, e.g. when it gets too large.