Generic Functions#
Generic Flagging Functions#
Generic flagging functions provide for custom cross-variable quality constraints, directly implemented using the Python API or the Configuration System.
Why?#
In most real world datasets many errors can be explained by the dataset itself. Think of a an active, fan-cooled measurement device: no matter how precise the instrument may work, problems are to be expected when the fan stops working or the power supply drops below a certain threshold. While these dependencies are easy to formalize on a per dataset basis, it is quite challenging to translate them into generic source code. That is why we instrumented SaQC to cope with such situations.
Generic Flagging - Specification#
Generic flagging functions are used in the same manner as their non-generic counterparts. The basic signature looks like that:
flagGeneric(field, func=<expression>, flag=<flag_constant>)
where <expression> is either a callable (Python API) or an expression
composed of the supported constructs
and <flag_constant> is either one of the predefined
flagging constants
(default: BAD) or a valid value of the chosen flagging scheme. Generic flagging functions
are expected to return a collection of boolean values, i.e. one True or False for every
value in field. All other expressions will fail during runtime of SaQC.
Examples#
The following sections show some contrived but realistic examples, highlighting the
potential of flagGeneric. Let’s first generate a
dummy dataset, to lead us through the following code snippets:
from saqc import SaQC
x = np.array([12, 87, 45, 31, 18, 99])
y = np.array([2, 12, 33, 133, 8, 33])
z = np.array([34, 23, 89, 56, 5, 1])
dates = pd.date_range(start="2020-01-01", periods=len(x), freq="D")
data = pd.DataFrame({"x": x, "y": y, "z": z}, index=dates)
qc = SaQC(data)
>>> qc.data
x | y | z |
============== | =============== | ============== |
2020-01-01 12 | 2020-01-01 2 | 2020-01-01 34 |
2020-01-02 87 | 2020-01-02 12 | 2020-01-02 23 |
2020-01-03 45 | 2020-01-03 33 | 2020-01-03 89 |
2020-01-04 31 | 2020-01-04 133 | 2020-01-04 56 |
2020-01-05 18 | 2020-01-05 8 | 2020-01-05 5 |
2020-01-06 99 | 2020-01-06 33 | 2020-01-06 1 |
Simple constraints#
Task: Flag all values of x where x is smaller than 30
qc1 = qc.flagGeneric(field="x", func=lambda x: x < 30)
>>> qc1.flags
x | y | z |
================= | =============== | =============== |
2020-01-01 255.0 | 2020-01-01 -inf | 2020-01-01 -inf |
2020-01-02 -inf | 2020-01-02 -inf | 2020-01-02 -inf |
2020-01-03 -inf | 2020-01-03 -inf | 2020-01-03 -inf |
2020-01-04 -inf | 2020-01-04 -inf | 2020-01-04 -inf |
2020-01-05 255.0 | 2020-01-05 -inf | 2020-01-05 -inf |
2020-01-06 -inf | 2020-01-06 -inf | 2020-01-06 -inf |
varname ; test
#-------;------------------------
x ; flagGeneric(func=x < 30)
As to be expected, the usual comparison operators are supported.
Cross variable constraints#
Task: Flag all values of x where y is larger than 30
qc2 = qc.flagGeneric(field="y", target="x", func=lambda y: y > 30)
>>> qc2.flags
x | y | z |
================= | =============== | =============== |
2020-01-01 -inf | 2020-01-01 -inf | 2020-01-01 -inf |
2020-01-02 -inf | 2020-01-02 -inf | 2020-01-02 -inf |
2020-01-03 255.0 | 2020-01-03 -inf | 2020-01-03 -inf |
2020-01-04 255.0 | 2020-01-04 -inf | 2020-01-04 -inf |
2020-01-05 -inf | 2020-01-05 -inf | 2020-01-05 -inf |
2020-01-06 255.0 | 2020-01-06 -inf | 2020-01-06 -inf |
We introduce another selection parameter, namely target. While field is still used to select
a variable from the dataset, which is subsequently passed to the given function func, target names the
variable to which SaQC writes the produced flags.
varname ; test
#-------;------------------------------------
x ; flagGeneric(field="y", func=y > 30)
Here the value in the first config column acts as the target, while field needs to be given
as function argument. In case field is not explicitly given, the first column acts as both,
field and target.
Multiple cross variable constraints#
Task: Flag all values of x where y is larger than 30 and z is smaller than 50:
In order to pass multiple variables to func, we need to also specify multiple field elements.
Note: to combine boolean expressions using one the available logical operators, they single expressions
need to be put in parentheses.
qc3 = qc.flagGeneric(field=["y", "z"], target="x", func=lambda y, z: (y > 30) & (z < 50))
>>> qc3.flags
x | y | z |
================= | =============== | =============== |
2020-01-01 -inf | 2020-01-01 -inf | 2020-01-01 -inf |
2020-01-02 -inf | 2020-01-02 -inf | 2020-01-02 -inf |
2020-01-03 -inf | 2020-01-03 -inf | 2020-01-03 -inf |
2020-01-04 -inf | 2020-01-04 -inf | 2020-01-04 -inf |
2020-01-05 -inf | 2020-01-05 -inf | 2020-01-05 -inf |
2020-01-06 255.0 | 2020-01-06 -inf | 2020-01-06 -inf |
The mapping
from field to the lambda function parameters is positional and not bound to names. That means
that the function parameters can be named arbitrarily.
varname ; test
#-------;--------------------------------------------------------
x ; flagGeneric(field=["y", "z"], func=(y > 30) & (z < 50))
Here the value in the first config column acts as the target, while field needs to be given
as a function argument. In case field is not explicitly given, the first column acts as both,
field and target.
The mapping from field to the names used in func is positional, i.e. the first value in field
is mapped to the first variable found in func.
Arithmetics#
Task: Flag all values of x, that exceed the arithmetic mean of y and z
qc4 = qc.flagGeneric(field=["x", "y", "z"], target="x", func=lambda x, y, z: x > (y + z)/2)
>>> qc4.flags
x | y | z |
================= | =============== | =============== |
2020-01-01 -inf | 2020-01-01 -inf | 2020-01-01 -inf |
2020-01-02 255.0 | 2020-01-02 -inf | 2020-01-02 -inf |
2020-01-03 -inf | 2020-01-03 -inf | 2020-01-03 -inf |
2020-01-04 -inf | 2020-01-04 -inf | 2020-01-04 -inf |
2020-01-05 255.0 | 2020-01-05 -inf | 2020-01-05 -inf |
2020-01-06 255.0 | 2020-01-06 -inf | 2020-01-06 -inf |
varname ; test #-------;------------------------------------------------------- x ; flagGeneric(field=["x", "y", "z"], func=x > (y + z)/2)
flagGenericsupports the usual arithmetic operators.
Special functions#
Task: Flag all values of x, that exceed 2 standard deviations of z.
qc5 = qc.flagGeneric(field=["x", "z"], target="x", func=lambda x, z: x > np.std(z) * 2)
>>> qc5.flags
x | y | z |
================= | =============== | =============== |
2020-01-01 -inf | 2020-01-01 -inf | 2020-01-01 -inf |
2020-01-02 255.0 | 2020-01-02 -inf | 2020-01-02 -inf |
2020-01-03 -inf | 2020-01-03 -inf | 2020-01-03 -inf |
2020-01-04 -inf | 2020-01-04 -inf | 2020-01-04 -inf |
2020-01-05 -inf | 2020-01-05 -inf | 2020-01-05 -inf |
2020-01-06 255.0 | 2020-01-06 -inf | 2020-01-06 -inf |
The selected variables are passed to func as arguments of type pd.Series, so all functions
accepting such an argument can be used in generic expressions.
varname ; test
#-------;---------------------------------------------------
x ; flagGeneric(field=["x", "z"], func=x > std(z) * 2)
In configurations files, the number of available mathematical functions is more restricted. Instead of basically all functions accepting array-like inputs, only certain built in mathematicalfunctions can be used.
Task: Flag all values of x where y is flagged.
qc6 = (qc
.flagRange(field="y", min=10, max=60)
.flagGeneric(field="y", target="x", func=lambda y: isflagged(y)))
>>> qc6.flags
x | y | z |
================= | ================= | =============== |
2020-01-01 255.0 | 2020-01-01 255.0 | 2020-01-01 -inf |
2020-01-02 -inf | 2020-01-02 -inf | 2020-01-02 -inf |
2020-01-03 -inf | 2020-01-03 -inf | 2020-01-03 -inf |
2020-01-04 255.0 | 2020-01-04 255.0 | 2020-01-04 -inf |
2020-01-05 255.0 | 2020-01-05 255.0 | 2020-01-05 -inf |
2020-01-06 -inf | 2020-01-06 -inf | 2020-01-06 -inf |
varname ; test
#-------;------------------------------------------
y ; flagRange(min=10, max=60)
x ; flagGeneric(field="y", func=isflagged(y))
A real world example#
Let’s consider the following dataset:
from saqc import SaQC
meas = np.array([3.56, 4.7, 0.1, 3.62])
fan = np.array([1, 0, 1, 1])
volt = np.array([12.1, 12.0, 11.5, 12.1])
dates = pd.to_datetime(["2018-06-01 12:00", "2018-06-01 12:10", "2018-06-01 12:20", "2018-06-01 12:30"])
data = pd.DataFrame({"meas": meas, "fan": fan, "volt": volt}, index=dates)
qc = SaQC(data)
>>> qc.data
meas | fan | volt |
========================= | ====================== | ========================= |
2018-06-01 12:00:00 3.56 | 2018-06-01 12:00:00 1 | 2018-06-01 12:00:00 12.1 |
2018-06-01 12:10:00 4.70 | 2018-06-01 12:10:00 0 | 2018-06-01 12:10:00 12.0 |
2018-06-01 12:20:00 0.10 | 2018-06-01 12:20:00 1 | 2018-06-01 12:20:00 11.5 |
2018-06-01 12:30:00 3.62 | 2018-06-01 12:30:00 1 | 2018-06-01 12:30:00 12.1 |
Task: Flag meas where fan equals 0 and volt is lower than 12.0.
Configuration file: There are various options. We can directly implement the condition as follows:
qc7 = qc.flagGeneric(field=["fan", "volt"], target="meas", func=lambda x, y: (x == 0) | (y < 12))
>>> qc7.flags
meas | fan | volt |
========================== | ======================== | ======================== |
2018-06-01 12:00:00 -inf | 2018-06-01 12:00:00 -inf | 2018-06-01 12:00:00 -inf |
2018-06-01 12:10:00 255.0 | 2018-06-01 12:10:00 -inf | 2018-06-01 12:10:00 -inf |
2018-06-01 12:20:00 255.0 | 2018-06-01 12:20:00 -inf | 2018-06-01 12:20:00 -inf |
2018-06-01 12:30:00 -inf | 2018-06-01 12:30:00 -inf | 2018-06-01 12:30:00 -inf |
varname ; test
#-------;---------------------------------------------------------------
meas ; flagGeneric(field=["fan", "volt"], func=(x == 0) | (y < 12.0))
But we could also quality check our independent variables first and than leverage this information later on:
qc8 = (qc
.flagMissing(".*", regex=True)
.flagGeneric(field="fan", func=lambda x: x == 0)
.flagGeneric(field="volt", func=lambda x: x < 12)
.flagGeneric(field=["fan", "volt"], target="meas", func=lambda x, y: isflagged(x) | isflagged(y)))
>>> qc8.flags
meas | fan | volt |
========================== | ========================== | ========================== |
2018-06-01 12:00:00 -inf | 2018-06-01 12:00:00 -inf | 2018-06-01 12:00:00 -inf |
2018-06-01 12:10:00 255.0 | 2018-06-01 12:10:00 255.0 | 2018-06-01 12:10:00 -inf |
2018-06-01 12:20:00 255.0 | 2018-06-01 12:20:00 -inf | 2018-06-01 12:20:00 255.0 |
2018-06-01 12:30:00 -inf | 2018-06-01 12:30:00 -inf | 2018-06-01 12:30:00 -inf |
varname ; test
#-------;--------------------------------------------------------------------------
'.*' ; flagMissing()
fan ; flagGeneric(func=fan == 0)
volt ; flagGeneric(func=volt < 12.0)
meas ; flagGeneric(field=["fan", "volt"], func=isflagged(fan) | isflagged(volt))
Generic Processing#
Generic processing functions provide a way to evaluate mathematical operations and functions on the variables of a given dataset.
Why#
In many real-world use cases, quality control is embedded into a larger data processing pipeline. It is not unusual to even have certain processing requirements as a part of the quality control itself. Generic processing functions make it easy to enrich a dataset through the evaluation of a given expression.
Generic Processing - Specification#
The basic signature looks like that:
processGeneric(field, func=<expression>)
where <expression> is either a callable (Python API) or an expression composed of the
supported constructs (Configuration File).
Example#
Let’s use flagGeneric to calculate the mean value of several
variables in a given dataset. We start with dummy data again:
from saqc import SaQC
x = np.array([12, 87, 45, 31, 18, 99])
y = np.array([2, 12, 33, 133, 8, 33])
z = np.array([34, 23, 89, 56, 5, 1])
dates = pd.date_range(start="2020-01-01", periods=len(x), freq="D")
data = pd.DataFrame({"x": x, "y": y, "z": z}, index=dates)
qc = SaQC(data)
qc1 = qc.processGeneric(
field=["x", "y", "z"],
target="mean",
func=lambda x, y, z: (x + y + z) / 2
)
>>> qc1.data
x | y | z | mean |
============== | =============== | ============== | ================= |
2020-01-01 12 | 2020-01-01 2 | 2020-01-01 34 | 2020-01-01 24.0 |
2020-01-02 87 | 2020-01-02 12 | 2020-01-02 23 | 2020-01-02 61.0 |
2020-01-03 45 | 2020-01-03 33 | 2020-01-03 89 | 2020-01-03 83.5 |
2020-01-04 31 | 2020-01-04 133 | 2020-01-04 56 | 2020-01-04 110.0 |
2020-01-05 18 | 2020-01-05 8 | 2020-01-05 5 | 2020-01-05 15.5 |
2020-01-06 99 | 2020-01-06 33 | 2020-01-06 1 | 2020-01-06 66.5 |
The call to flagGeneric added the new variable mean
to the dataset.
varname ; test
#-------;------------------------------------------------------
mean ; processGeneric(field=["x", "y", "z"], func=(x+y+z)/2)
Supported constructs#
Operators#
Comparison Operators#
The following comparison operators are available:
Operator |
Description |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
Logical operators#
The bitwise operators act as logical operators in comparison chains
Operator |
Description |
|---|---|
|
binary and |
|
binary or |
|
binary xor |
|
binary complement |
Arithmetic Operators#
The following arithmetic operators are supported:
Operator |
Description |
|---|---|
|
addition |
|
subtraction |
|
multiplication |
|
division |
|
exponentiation |
|
modulus |
Functions#
Mathematical Functions#
Name |
Description |
|---|---|
|
absolute values of a variable |
|
maximum value of a variable |
|
minimum value of a variable |
|
mean value of a variable |
|
sum of a variable |
|
standard deviation of a variable |
|
Pointwise absolute Value Function. |
|
Maximum Value Function. Ignores NaN. |
|
Minimum Value Function. Ignores NaN. |
|
Mean Value Function. Ignores NaN. |
|
Summation. Ignores NaN. |
|
Standart Deviation. Ignores NaN. |
|
Pointwise Exponential. |
|
Pointwise Logarithm. |
|
Logarithm, returning NaN for zero input, instead of -inf. |
|
Standart Deviation. Ignores NaN. |
|
Variance. Ignores NaN. |
|
Median. Ignores NaN. |
|
Count Number of values. Ignores NaNs. |
|
Identity. |
|
Returns a Series` diff. |
|
Scales data to [0,1] Interval. |
|
Standardize with Standart Deviation. |
|
Standardize with Median and MAD. |
|
Standardize with Median and inter quantile range. |
Miscellaneous Functions#
Name |
Description |
|---|---|
|
Pointwise, checks if a value is flagged |
|
Returns the length of passed series |

