dantro.utils.data_ops module

This module implements data processing operations for dantro objects

dantro.utils.data_ops.print_data(data: Any, *, end: str = '\n', fstr: str = None, **fstr_kwargs) → Any[source]

Prints and passes on the data.

The print operation distinguishes between dantro types (in which case some more information is shown) and non-dantro types. If a custom format string is given, will always use that one.

Parameters
  • data (Any) – The data to print

  • end (str, optional) – The end argument to the print call

  • fstr (str, optional) – If given, will use this to format the data for printing. The data will be the passed as first positional argument to the format string, thus addressable by {0:}. If the format string is not None, will always use the format string and not use the custom formatting for dantro objects.

  • **fstr_kwargs – Keyword arguments passed to the format operation.

Returns

the given data

Return type

Any

dantro.utils.data_ops.expression(expr: str, *, symbols: dict = None, evaluate: bool = True, transformations: Tuple[Callable] = None, astype: Union[type, str] = <class 'float'>)[source]

Parses and evaluates a symbolic math expression using SymPy.

For parsing, uses sympy’s parse_expr function (see documentation of the parsing module). The symbols are provided as local_dict; the global_dict is not explicitly set and subsequently uses the sympy default value, containing all basic sympy symbols and notations.

Note

The expression given here is not Python code, but symbolic math. You cannot call arbitrary functions, but only those that are imported by from sympy import *.

Hint

When using this expression as part of the Data Transformation Framework, it is attached to a so-called syntax hook that makes it easier to specify the symbols parameter. See here for more information.

Warning

While the expression is symbolic math, be aware that smypy by default interprets the ^ operator as XOR. For exponentiation, use the``**`` operator or adjust the transformations argument as specified in the sympy documentation.

Warning

While the expression is symbolic math, it uses the ** operator for exponentiation, unless a custom transformations argument is given.

Thus, the ^ operator will lead to an XOR operation being performed!

Warning

The return object of this operation will only contain symbolic sympy objects if astype is None. Otherwise, the type cast will evaluate all symbolic objects to the numerical equivalent specified by the given astype.

Parameters
  • expr (str) – The expression to evaluate

  • symbols (dict, optional) – The symbols to use

  • evaluate (bool, optional) – Controls whether sympy evaluates expr. This may lead to a fully evaluated result, but does not guarantee that no sympy objects are contained in the result. For ensuring a fully numerical result, see the astype argument.

  • transformations (Tuple[Callable], optional) – The transformations argument for sympy’s parse_expr. By default, the sympy standard transformations are performed.

  • astype (Union[type, str], optional) – If given, performs a cast to this data type, fully evaluating all symbolic expressions. Default: Python float.

Raises
  • TypeError – Upon failing astype cast, e.g. due to free symbols remaining in the evaluated expression.

  • ValueError – When parsing of expr failed.

Returns

The result of the evaluated expression.

dantro.utils.data_ops.generate_lambda(expr: str) → Callable[source]

Generates a lambda from a string. This is useful when working with callables in other operations.

The expr argument needs to be a valid Python lambda expression, see here.

Inside the lambda body, the following names are available for use:

  • A large part of the builtins module

  • Every name from the Python math module, e.g. sin, cos, …

  • These modules (and their long form): np, xr, scipy

Internally, this uses eval but imposes the following restrictions:

  • The following strings may not appear in expr: ;, __.

  • There can be no nested lambda, i.e. the only allowed lambda string is that in the beginning of expr.

  • The dangerous parts from the builtins module are not available.

Parameters

expr (str) – The expression string to evaluate into a lambda.

Returns

The generated Callable.

Return type

Callable

Raises

SyntaxError – Upon failed evaluation of the given expression, invalid expression pattern, or disallowed strings in the lambda body.

dantro.utils.data_ops.create_mask(data: xarray.core.dataarray.DataArray, operator_name: str, rhs_value: float) → xarray.core.dataarray.DataArray[source]

Given the data, returns a binary mask by applying the following comparison: data <operator> rhs value.

Parameters
  • data (xr.DataArray) – The data to apply the comparison to. This is the lhs of the comparison.

  • operator_name (str) – The name of the binary operator function as registered in the BOOLEAN_OPERATORS constant.

  • rhs_value (float) – The right-hand-side value

Raises

KeyError – On invalid operator name

Returns

Boolean mask

Return type

xr.DataArray

dantro.utils.data_ops.where(data: xarray.core.dataarray.DataArray, operator_name: str, rhs_value: float, **kwargs) → xarray.core.dataarray.DataArray[source]

Filter elements from the given data according to a condition. Only those elemens where the condition is fulfilled are not masked.

NOTE This typically leads to a dtype change to float.

Parameters
  • data (xr.DataArray) – The data to mask

  • operator_name (str) – The operator argument used in create_mask()

  • rhs_value (float) – The rhs_value argument used in create_mask()

  • **kwargs – Passed on to data.where attribute call

dantro.utils.data_ops.count_unique(data, dims: List[str] = None) → xarray.core.dataarray.DataArray[source]

Applies np.unique to the given data and constructs a xr.DataArray for the results.

NaN values are filtered out.

Parameters
  • data – The data

  • dims (List[str], optional) – The dimensions along which to apply np.unique. The other dimensions will be available after the operation. If not provided it is applied along all dims.

dantro.utils.data_ops.populate_ndarray(objs: Iterable, shape: Tuple[int] = None, dtype: Union[str, type, numpy.dtype] = <class 'float'>, order: str = 'C', out: numpy.ndarray = None, ufunc: Callable = None) → numpy.ndarray[source]

Populates an empty np.ndarray of the given dtype with the given objects by zipping over a new array of the given shape and the sequence of objects.

Parameters
  • objs (Iterable) – The objects to add to the np.ndarray. These objects are added in the order they are given here. Note that their final position inside the resulting array is furthermore determined by the order argument.

  • shape (Tuple[int], optional) – The shape of the new array. Required if no out array is given.

  • dtype (Union[str, type, np.dtype], optional) – dtype of the new array. Ignored if out is given.

  • order (str, optional) – Order of the new array, determines iteration order. Ignored if out is given.

  • out (np.ndarray, optional) – If given, populates this array rather than an empty array.

  • ufunc (Callable, optional) – If given, applies this unary function to each element before storing it in the to-be-returned ndarray.

Returns

The populated out array or the newly created one (if

out was not given)

Return type

np.ndarray

Raises
  • TypeError – On missing

  • ValueError – If the number of given objects did not match the array size

dantro.utils.data_ops.build_object_array(objs: Union[Dict, Sequence], *, dims: Tuple[str] = ('label',), fillna: Any = None) → xarray.core.dataarray.DataArray[source]

Creates a simple labelled multidimensional object array.

It accepts simple iterable types like dictionaries or lists and unpacks them into the array, using the key or index (respectively) as coordinate for the entry. For dict-like entries, multi-dimensional coordinates can be specified by using tuples for keys. Subsequently, list-like iterable types (list, tuple etc.) will result in one-dimensional output array.

Warning

This data operation is built for flexibility, not for speed. It will call the merge() operation for every element in the objs iterable, thus being slow and potentially creating an array with many empty elements. To efficiently populate an n-dimensional object array, use the populate_ndarray() operation instead and build a labelled array from that output.

Parameters
  • objs (Union[Dict, Sequence]) – The objects to populate the object array with. If dict-like, keys are assumed to encode coordinates, which can be of the form coord0 or (coord0, coord1, …), where the tuple-form requires as many coordinates as there are entries in the dims argument. If list- or tuple-like (more exactly: if missing the items attribute) trivial indexing is used and dims needs to be 1D.

  • dims (Tuple[str], optional) – The names of the dimensions of the labelled array.

  • fillna (Any, optional) – The fill value for entries that are not covered by the dimensions specified by objs. Note that this will replace all null values, which includes NaN but also None. This operation is only called if fillna is not None.

Raises

ValueError – If coordinates and/or dims argument for individual entries did not match.

dantro.utils.data_ops.multi_concat(arrs: numpy.ndarray, *, dims: Sequence[str]) → xarray.core.dataarray.DataArray[source]

Concatenates xr.Dataset or xr.DataArray objects using xr.concat. This function expects the xarray objects to be pre-aligned inside the numpy object array arrs, with the number of dimensions matching the number of concatenation operations desired. The position inside the array carries information on where the objects that are to be concatenated are placed inside the higher dimensional coordinate system.

Through multiple concatenation, the dimensionality of the contained objects is increased by dims, while their dtype can be maintained.

For the sequential application of xr.concat along the outer dimensions, the custom dantro.tools.apply_along_axis() is used.

Parameters
  • arrs (np.ndarray) – The array containing xarray objects which are to be concatenated. Each array dimension should correspond to one of the given dims. For each of the dimensions, the xr.concat operation is applied along the axis, effectively reducing the dimensionality of arrs to a scalar and increasing the dimensionality of the contained xarray objects until they additionally contain the dimensions specified in dims.

  • dims (Sequence[str]) – A sequence of dimension names that is assumed to match the dimension names of the array. During each concatenation operation, the name is passed along to xr.concat where it is used to select the dimension of the content of arrs along which concatenation should occur.

Raises

ValueError – If number of dimension names does not match the number of data dimensions.

dantro.utils.data_ops.merge(arrs: Union[Sequence[Union[xarray.core.dataarray.DataArray, xarray.core.dataset.Dataset]], numpy.ndarray], *, reduce_to_array: bool = False, **merge_kwargs) → Union[xarray.core.dataset.Dataset, xarray.core.dataarray.DataArray][source]

Merges the given sequence of xarray objects into an xr.Dataset.

As a convenience, this also allows passing a numpy object array containing the xarray objects. Furthermore, if the resulting Dataset contains only a single data variable, that variable can be extracted as a DataArray which is then the return value of this operation.

dantro.utils.data_ops.expand_dims(d: Union[numpy.ndarray, xarray.core.dataarray.DataArray], *, dim: dict = None, **kwargs) → xarray.core.dataarray.DataArray[source]

Expands the dimensions of the given object.

If the object does not support the expand_dims method, it will be attempted to convert it to an xr.DataArray.

Parameters
  • d (Union[np.ndarray, xr.DataArray]) – The object to expand the dimensions of

  • dim (dict, optional) – Keys specify the dimensions to expand, values can either be an integer specifying the length of the dimension, or a sequence of coordinates.

  • **kwargs – Passed on to expand_dims method

Returns

The input data with expanded dimensions.

Return type

xr.DataArray

dantro.utils.data_ops.expand_object_array(d: xarray.core.dataarray.DataArray, *, shape: Sequence[int] = None, astype: Union[str, type, numpy.dtype] = None, dims: Sequence[str] = None, coords: Union[dict, str] = 'trivial', combination_method: str = 'concat', allow_reshaping_failure: bool = False, **combination_kwargs) → xarray.core.dataarray.DataArray[source]

Expands a labelled object-array that contains array-like objects into a higher-dimensional labelled array.

d is expected to be an array of arrays, i.e. each element of the outer array is an object that itself is an np.ndarray-like object. The shape is the expected shape of each of these inner arrays. Importantly, all these arrays need to have the exact same shape.

Typically, e.g. when loading data from HDF5 files, the inner array will not be labelled but will consist of simple np.ndarrays. The arguments dims and coords are used to label the inner arrays.

This uses multi_concat() for concatenating or merge() for merging the object arrays into a higher-dimensional array, where the latter option allows for missing values.

Todo

Make reshaping and labelling optional if the inner array already is a labelled array. In such cases, the coordinate assignment is already done and all information for combination is already available.

Parameters
  • d (xr.DataArray) – The labelled object-array containing further arrays as elements (which are assumed to be unlabelled).

  • shape (Sequence[int], optional) – Shape of the inner arrays. If not given, the first element is used to determine the shape.

  • astype (Union[str, type, np.dtype], optional) – All inner arrays need to have the same dtype. If this argument is given, the arrays will be coerced to this dtype. For numeric data, float is typically a good fallback. Note that with combination_method == "merge", the choice here might not be respected.

  • dims (Sequence[str], optional) – Dimension names for labelling the inner arrays. This is necessary for proper alignment. The number of dimensions need to match the shape. If not given, will use inner_dim_0 and so on.

  • coords (Union[dict, str], optional) – Coordinates of the inner arrays. These are necessary to align the inner arrays with each other. With coords = "trivial", trivial coordinates will be assigned to all dimensions. If specifying a dict and giving "trivial" as value, that dimension will be assigned trivial coordinates.

  • combination_method (str, optional) – The combination method to use to combine the object array. For concat, will use dantro’s multi_concat(), which preserves dtype but does not allow missing values. For merge, will use merge(), which allows missing values (masked using np.nan) but leads to the dtype decaying to float.

  • allow_reshaping_failure (bool, optional) – If true, the expansion is not stopped if reshaping to shape fails for an element. This will lead to missing values at the respective coordinates and the combination_method will automatically be changed to merge.

  • **combination_kwargs – Passed on to the selected combination function, multi_concat() or merge().

Returns

A new, higher-dimensional labelled array.

Return type

xr.DataArray

Raises
  • TypeError – If no shape can be extracted from the first element in the input data d

  • ValueError – On bad argument values for dims, shape, coords or combination_method.

dantro.utils.data_ops.raise_SkipPlot(cond: bool = True, *, reason: str = '', passthrough: Any = None)[source]

Raises SkipPlot to trigger that a plot is skipped without error, see Skipping Plots.

If cond is False, this will do nothing but return the passthrough.

Parameters
  • cond (bool, optional) – Whether to actually raise the exception

  • reason (str, optional) – The reason for skipping, optional

  • passthrough (Any, optional) – A passthrough value which is returned if cond did not evaluate to True.

dantro.utils.data_ops.register_operation(*, name: str, func: Callable, skip_existing: bool = False, overwrite_existing: bool = False) → None[source]

Adds an entry to the shared operations registry.

Parameters
  • name (str) – The name of the operation

  • func (Callable) – The callable

  • skip_existing (bool, optional) – Whether to skip registration if the operation name is already registered. This suppresses the ValueError raised on existing operation name.

  • overwrite_existing (bool, optional) – Whether to overwrite a potentially already existing operation of the same name. If given, this takes precedence over skip_existing.

Raises
  • TypeError – On invalid name or non-callable for the func argument

  • ValueError – On already existing operation name and no skipping or overwriting enabled.

dantro.utils.data_ops.apply_operation(op_name: str, *op_args, _log_level: int = 5, **op_kwargs) → Any[source]

Apply an operation with the given arguments and then return it.

Parameters
  • op_name (str) – The name of the operation to carry out; need to be part of the OPERATIONS database.

  • *op_args – The positional arguments to the operation

  • _log_level (int, optional) – Log level of the log messages created by this function.

  • **op_kwargs – The keyword arguments to the operation

Returns

The result of the operation

Return type

Any

Raises
dantro.utils.data_ops.available_operations(*, match: str = None, n: int = 5) → Sequence[str][source]

Returns all available operation names or a fuzzy-matched subset of them.

Parameters
  • match (str, optional) – If given, fuzzy-matches the names and only returns close matches to this name.

  • n (int, optional) – Number of close matches to return. Passed on to difflib.get_close_matches

Returns

All available operation names or the matched subset.

The sequence is sorted alphabetically.

Return type

Sequence[str]