ufzLogo rdmLogo

SaQC#

class SaQC(data=None, flags=None, scheme='float')[source]#

Bases: FunctionsMixin

Attributes Summary

attrs

Dictionary of global attributes of this dataset.

columns

data

flags

scheme

Methods Summary

align(field, freq[, method, order, overwrite])

Resample data and flags at uniform frequency.

andGroup(field[, group, target, flag])

Combine flags via AND operation.

applyConfig(field, path[, name])

Apply the processing/flagging pipeline represented by an univariate config File to field, or "instantiate" the config file as a saqc method.

assignChangePointCluster(field, stat_func, ...)

Label data where it changes significantly.

assignKNNScore(field, target[, n, func, ...])

K nearest neighbor scoring.

assignLOF(field, target[, n, freq, ...])

Assign Local Outlier Factor (LOF).

assignRegimeAnomaly(field, cluster_field, spread)

A function to detect values belonging to an anomalous regime regarding modelling regimes of field.

assignUniLOF(field[, n, algorithm, p, ...])

Univariate outlier scoring, based on LOF.

assignZScore(field[, window, norm_func, ...])

Calculate (rolling) Zscores.

calculatePolynomialResiduals(field, window, ...)

Residuals from polynomial fit.

calculateRollingResiduals(field, window[, ...])

Residuals from sliding window fit.

calibratePipeline(field, problems[, name, ...])

Optimize problem pipeline against supervised field.

clearFlags(field, **kwargs)

Assign UNFLAGGED to all periods.

concatFlags(field[, target, method, invert, ...])

Resample a variables flags and append them to another variables flags.

copy([deep])

copyField(field, target[, overwrite])

Copy data and flags.

correctDrift(field, maintenance_field, model)

Correct model defined drifts.

correctOffset(field, max_jump, spread, ...)

Correct offsets to normal value level.

correctRegimeAnomaly(field, cluster_field, model)

Regimen wise model fitting.

dropField(field, **kwargs)

Drop field.

fitLowpassFilter(field, cutoff[, nyq, ...])

Filter and smooth data with Butterworth filter.

fitMomentFM(field[, ratio, context, agg, ...])

Moment Foundational Timeseries Model (MomentFM).

fitPolynomial(field, window, order[, ...])

Fit a polynomial model to the data.

flagByClick(field[, max_gap, gui_mode, ...])

Graphical user interface for flags assignment.

flagByScatterLowpass(field, window, thresh)

Flag data noisy data.

flagByStray(field[, window, min_periods, ...])

Flag outliers with the STRAY Algorithm.

flagByVariance(field, window, thresh[, ...])

Flag low-variance data.

flagChangePoints(field, stat_func, ...[, ...])

Flag values that represent a system state transition.

flagConstants(field, thresh, window[, ...])

Flag constant data values.

flagDriftFromNorm(field, window, spread[, ...])

Flags Deviation from central moment.

flagDriftFromReference(field, reference, ...)

Flags data that deviates from a reference.

flagDummy(field, **kwargs)

Pass on data and flags.

flagGeneric(field, func[, target, flag])

Apply custom flagging rule.

flagIsolated(field, gap_window, group_window)

Flag groups of data that are surrounded by data gaps.

flagJumps(field, thresh, window[, ...])

Flag jumps and drops in data.

flagLOF(field[, n, thresh, algorithm, p, flag])

Local Outlier Factor.

flagMissing(field[, flag, dfilter])

Deprecated since version 2.7.0.

flagNAN(field[, flag, dfilter])

Flag NaNs in data.

flagOffset(field, window[, tolerance, ...])

Flag offsetted groups of values.

flagPatternByDTW(field, reference[, ...])

Pattern Recognition with DTW metric.

flagPlateau(field, min_length[, max_length, ...])

Flag ofsetted groups of data..

flagRange(field[, min, max, flag])

Flag values that exceed fixed bound.

flagRegimeAnomaly(field, cluster_field, spread)

Flags anomalous regimes regarding to modelling regimes of field.

flagUnflagged(field[, flag])

Assign flag to all UNFLAGGED periods.

flagUniLOF(field[, n, thresh, probability, ...])

Univariate outlier detection, based on LOF.

flagZScore(field[, method, window, thresh, ...])

Scattering (ZScoring) based outlier detection.

forceFlags(field[, flag])

Assign specific flag to all periods.

interpolateByRolling(field, window[, func, ...])

Impute NAN with aggregation of context.

orGroup(field[, group, target, flag])

Combine flags via OR operation.

plot(field[, path, max_gap, mode, history, ...])

Generate plots.

processGeneric(field, func[, target, dfilter])

Apply custom transformation.

propagateFlags(field, window[, method, ...])

Propagate flags along date axis.

reindex(field, index[, method, tolerance, ...])

Resample data at new index.

renameField(field, new_name, **kwargs)

Rename field.

resample(field, freq[, func, method, maxna, ...])

Sample data at uniform sampling rate.

rolling(field, window[, target, func, ...])

Rolling window function application.

selectTime(field, mode[, selection_field, ...])

Apply a mask.

setFlags(field, data[, override, flag])

Assign scheduled flags.

supervise(field, problem_labels[, override, ...])

Supervise data, so saqc parameter estimation can be ran against it.

transferFlags(field, target[, squeeze, ...])

Transfer flags between variables.

transform(field, func[, freq])

Data transformation.

Attributes Documentation

attrs#

Dictionary of global attributes of this dataset.

columns#
data#
flags#
scheme#

Methods Documentation

align(field, freq, method='time', order=2, overwrite=False, **kwargs)#

Resample data and flags at uniform frequency.

Convert a time series to a specified frequency, interpolating or imputing values and flags according to the chosen method. Calling with field being a list of field names results in the results sharing one index.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • freq (Union[FreqStr, int]) –

    New sampling rate.

    The sampling frequency data is transformed to align with. If a list of fields is passed, all the fields will be aligned to the same index.

  • method (Literal['nshift', 'bshift', 'fshift', 'linear', 'time', 'index', 'values', 'pad', 'spline', 'polynomialnearest', 'zero', 'slinear', 'quadratic', 'cubic', 'barycentric', 'krogh', 'pchip', 'akima', 'cubicspline', 'from_derivatives'] (default: 'time')) –

    Sampling method.

    Determines how and which values are assigned to the new Index. Supported methods include:

    • 'nshift': Shift grid points to the nearest time stamp within +/- 0.5 * freq.

    • 'bshift': Shift grid points to the first succeeding time stamp.

    • 'fshift': Shift grid points to the last preceding time stamp.

    • 'linear', 'time', 'index', 'values': Use numerical values of the index. (Note: internally mapped to 'mshift'.)

    • 'pad': Fill NaNs using existing values (same as 'fshift').

    • 'spline', 'polynomial': Passed to scipy.interpolate.interp1d. Requires specifying order.

    • 'nearest', 'zero', 'slinear', 'quadratic', 'cubic', 'barycentric': Passed to scipy.interpolate.interp1d.

    • 'krogh', 'pchip', 'akima', 'cubicspline': Wrappers around SciPy interpolation methods.

    • 'from_derivatives': Uses scipy.interpolate.BPoly.from_derivatives.

  • order (int>0 (default: 2)) –

    Method order.

    Some methods need an order or degree specified additionally. (e.g., polynomial, spline). Ignored otherwise.

  • overwrite (bool (default: False)) –

    Overwrite existing flags.

    If True, existing flags will be cleared.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

andGroup(field, group=None, target=None, flag=255.0, **kwargs)#

Combine flags via AND operation.

Flag the variable(s) field at every period, at wich field in all of the saqc objects in group is flagged.

See Examples section for examples.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • group (Optional[Sequence[SaQC]] (default: None)) –

    AND operands.

    A collection of SaQC objects. Flag checks are performed on all SaQC objects based on the variables specified in field. Whenever all monitored variables are flagged, the associated timestamps will receive a flag.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Examples

Flag data, if the values are above a certain threshold (determined by flagRange()) AND if the values are constant for 3 periods (determined by flagConstants())

>>> dat = pd.Series([1,0,0,0,1,2,3,4,5,5,5,4], name='data', index=pd.date_range('2000', freq='10min', periods=12))
>>> qc = saqc.SaQC(dat)
>>> qc = qc.andGroup('data', group=[qc.flagRange('data', max=4), qc.flagConstants('data', thresh=0, window=3)])
>>> qc.flags['data']
2000-01-01 00:00:00     -inf
2000-01-01 00:10:00     -inf
2000-01-01 00:20:00     -inf
2000-01-01 00:30:00     -inf
2000-01-01 00:40:00     -inf
2000-01-01 00:50:00     -inf
2000-01-01 01:00:00     -inf
2000-01-01 01:10:00     -inf
2000-01-01 01:20:00    255.0
2000-01-01 01:30:00    255.0
2000-01-01 01:40:00    255.0
2000-01-01 01:50:00     -inf
Freq: 10min, dtype: float64

Masking data, so that a test result only gets assigned during daytime (between 6 and 18 o clock for example). The daytime condition is generated via flagGeneric():

>>> from saqc.lib.tools import periodicMask
    >>> mask_func = lambda x: ~periodicMask(x.index, '06:00:00', '18:00:00', True)
>>> dat = pd.Series(range(100), name='data', index=pd.date_range('2000', freq='4h', periods=100))
>>> qc = saqc.SaQC(dat)
>>> qc = qc.andGroup('data', group=[qc.flagRange('data', max=5), qc.flagGeneric('data', func=mask_func)])
>>> qc.flags['data'].head(20)
2000-01-01 00:00:00     -inf
2000-01-01 04:00:00     -inf
2000-01-01 08:00:00     -inf
2000-01-01 12:00:00     -inf
2000-01-01 16:00:00     -inf
2000-01-01 20:00:00     -inf
2000-01-02 00:00:00     -inf
2000-01-02 04:00:00     -inf
2000-01-02 08:00:00    255.0
2000-01-02 12:00:00    255.0
2000-01-02 16:00:00    255.0
2000-01-02 20:00:00     -inf
2000-01-03 00:00:00     -inf
2000-01-03 04:00:00     -inf
2000-01-03 08:00:00    255.0
2000-01-03 12:00:00    255.0
2000-01-03 16:00:00    255.0
2000-01-03 20:00:00     -inf
2000-01-04 00:00:00     -inf
2000-01-04 04:00:00     -inf
Freq: 4h, dtype: float64
applyConfig(field, path, name=None, **kwargs)#

Apply the processing/flagging pipeline represented by an univariate config File to field, or “instantiate” the config file as a saqc method.

Univariat Config File: * depends on only one input field for generation of all intermediary results/processings * flagging/processing result is all assigned/represented by final flagging/data status of the input field * all configs generated from SaQCProblem chains are univariat configs

Parameters:
  • field (str) – Name of the input variable to process.

  • path (str) – Path to the config file to load.

  • name (str (default: None)) – If given, the process representing the config file will be added to the saqc methods and be accesible via "name". In this case execution of the Algorithm onto field wont be performed.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

assignChangePointCluster(field, stat_func, thresh_func, window, min_periods, reduce_window=None, reduce_func=<function ChangepointsMixin.<lambda>>, model_by_resids=False, **kwargs)#

Label data where it changes significantly.

The labels will be stored in data. Unless target is given the labels will overwrite the data in field. The flags will always set to UNFLAGGED.

Assigns label to the data, aiming to reflect continuous regimes of the processes the data is assumed to be generated by. The regime change points detection is based on a sliding window search.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • stat_func (Callable[[ndarray, ndarray], float]) –

    Aggregation function for rolling twin windows.

    A function that assigns a value to every twin window. The backward-facing window content will be passed as the first array, the forward-facing window content as the second.

  • thresh_func (Callable[[ndarray, ndarray], float]) –

    Threshold function for rolling twin windows.

    A function that determines the value level, exceeding wich qualifies a timestamps func value as denoting a change-point.

  • window (OffsetStr | tuple[OffsetStr, OffsetStr]) –

    Size of the moving twin windows.

    This is the number of observations used for calculating the statistic.

    If it is a single frequency offset, it applies for the backward- and the forward-facing window.

    If two offsets (as a tuple) is passed the first defines the size of the backward facing window, the second the size of the forward facing window.

  • min_periods (int>=0 | tuple[int>=0, int>=0]) –

    Minimum population required in every window.

    Minimum number of observations in a window required to perform the changepoint test. If it is a tuple of two int, the first refer to the backward-, the second to the forward-facing window.

  • reduce_window (Optional[OffsetStr] (default: None)) –

    Merge adjacent changepoints.

    The sliding window search method is not an exact CP search method and usually there wont be detected a single changepoint, but a “region” of change around a changepoint.

    If reduce_window is given, for every window of size reduce_window, there will be selected the value with index reduce_func(x, y) and the others will be dropped.

    If reduce_window is None, the reduction window size equals the twin window size, the changepoints have been detected with.

  • reduce_func (Callable[[ndarray, ndarray], float] (default: <function ChangepointsMixin.<lambda> at 0x7f22aa8ca480>)) –

    Merge-function for adjacent changepoints.

    A function that must return an index value upon input of two arrays x and y. First input parameter will hold the result from the stat_func evaluation for every reduction window. Second input parameter holds the result from the thresh_func evaluation. The default reduction function just selects the value that maximizes the stat_func.

  • model_by_resids (bool (default: False)) –

    Assign labels or statistics.

    If True, the results of stat_funcs are written, otherwise the regime labels.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

assignKNNScore(field, target, n=10, func='sum', freq=inf, min_periods=2, algorithm='ball_tree', metric='minkowski', p=2, **kwargs)#

K nearest neighbor scoring.

Score datapoints by an aggregation of the distances to their k nearest neighbors.

The function is a wrapper around the NearestNeighbors method from pythons sklearn library (See reference [1]).

The steps taken to calculate the scores are as follows:

  1. All the timeseries, given through field, are combined to one feature space by an inner join on their date time indexes. thus, only samples, that share timestamps across all field will be included in the feature space.

  2. Any datapoint/sample, where one ore more of the features is invalid (=np.nan) will get excluded.

  3. For every data point, the distance to its n nearest neighbors is calculated by applying the metric metric at grade p onto the feature space. The defaults lead to the euclidian to be applied. If radius is not None, it sets the upper bound of distance for a neighbor to be considered one of the n nearest neighbors. Furthermore, the freq argument determines wich samples can be included into a datapoints nearest neighbors list, by segmenting the data into chunks of specified temporal extension and feeding that chunks to the kNN algorithm seperatly.

  4. For every datapoint, the calculated nearest neighbors distances get aggregated to a score, by the function passed to func. The default, sum obviously just sums up the distances.

  5. The resulting timeseries of scores gets assigned to the field target.

Parameters:
  • field (SaQCFields) – List of variables names to process.

  • n (int>0 (default: 10)) –

    The number of nearest neighbors.

    The number of nearest neighbors comprised in every datapoints scoring calculation.

  • func (Union[Callable[[Series], float], Literal['sum', 'mean', 'median', 'min', 'max', 'last', 'first', 'std', 'var', 'count', 'sem', 'linear', 'time']] (default: 'sum')) –

    Distance aggregation.

    A function that assigns a score to every one dimensional array, containing the distances to every datapoints n nearest neighbors.

  • freq (float>=0 | FreqStr (default: inf)) –

    Data partitioning size.

    Determines the segmentation of the data into partitions, the kNN algorithm is applied onto individually.

    • np.inf: Apply Scoring on whole data set at once

    • x > 0 : Apply scoring on successive data chunks of periods length x

    • Offset String : Apply scoring on successive partitions of temporal extension matching the passed offset string

  • min_periods (int>=0 (default: 2)) –

    Minimum population per partition.

    The minimum number of periods that have to be present in a window for the kNN scoring to be applied. If the number of periods present is below min_periods, the score for the datapoints in that window will be np.nan.

  • algorithm (Literal['ball_tree', 'kd_tree', 'brute', 'auto'] (default: 'ball_tree')) –

    Nearest Neighbors searching algorithm.

    The search algorithm to find each datapoints k nearest neighbors. The keyword just gets passed on to the underlying sklearn method. See reference [1] for more information on the algorithm.

  • metric (Literal['euclidean', 'l2', 'minkowski', 'p', 'manhattan', 'cityblock', 'l1', 'chebyshev', 'infinity', 'seuclidean', 'mahalanobis', 'hamming', 'canberra', 'braycurtis', 'jaccard', 'dice', 'rogerstanimoto', 'russellrao', 'sokalmichener', 'sokalsneath', 'haversine', 'pyfunc'] (default: 'minkowski')) –

    Distance metric.

    The metric the distances to any datapoints neighbors is computed with. The default of metric together with the default of p result in the euclidian to be applied. The keyword just gets passed on to the underlying sklearn method. See reference [1] for more information on the algorithm.

  • p (int>0 (default: 2)) –

    Metrics (minkowski) degree.

    The grade of the metrice specified by parameter metric. The keyword just gets passed on to the underlying sklearn method. See reference [1] for more information on the algorithm.

  • target (SaQCFields | newSaQCFields) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

References

[1] https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html

assignLOF(field, target, n=20, freq=inf, min_periods=2, algorithm='ball_tree', p=2, **kwargs)#

Assign Local Outlier Factor (LOF).

Parameters:
  • field (SaQCFields) – List of variables names to process.

  • n (int>0 (default: 20)) –

    Number of nearest neighbors.

    Number of periods to be included into the LOF calculation. Defaults to 20, which is a value found to be suitable in the literature.

  • freq (float>0 | FreqStr (default: inf)) –

    Data partitioning frequency.

    Determines the segmentation of the data into partitions, the kNN algorithm is applied onto individually.

  • algorithm (Literal['ball_tree', 'kd_tree', 'brute', 'auto'] (default: 'ball_tree')) –

    Nearest neighbors search algorithm.

    Algorithm used for calculating the n-nearest neighbors needed for LOF calculation.

  • p (int>0 (default: 2)) –

    Distance metric degree.

    Degree of the metric (“Minkowski”), according to wich distance to neighbors is determined. Most important values are:

    • 1 - Manhatten Metric

    • 2 - Euclidian Metric

  • target (SaQCFields | newSaQCFields) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Notes

  • n determines the “locality” of an observation (its n nearest neighbors) and sets the upper limit of values of an outlier clusters (i.e. consecutive outliers). Outlier clusters of size greater than n/2 may not be detected reliably.

  • The larger n, the lesser the algorithm’s sensitivity to local outliers and small or singleton outliers points. Higher values greatly increase numerical costs.

assignRegimeAnomaly(field, cluster_field, spread, method='single', metric=<function DriftMixin.<lambda>>, frac=0.5, **kwargs)#

A function to detect values belonging to an anomalous regime regarding modelling regimes of field.

The function changes the value of the regime cluster labels to be negative. “Normality” is determined in terms of a maximum spreading distance, regimes must not exceed in respect to a certain metric and linkage method. In addition, only a range of regimes is considered “normal”, if it models more then frac percentage of the valid samples in “field”. Note, that you must detect the regime changepoints prior to calling this function. (They are expected to be stored parameter cluster_field.)

Note, that it is possible to perform hypothesis tests for regime equality by passing the metric a function for p-value calculation and selecting linkage method “complete”.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • cluster_field (SaQCColumns) –

    Cluster labels variable.

    Column in data, holding the cluster labels for the samples in field. (has to be indexed equal to field)

  • spread (float>=0) –

    Agglomeration supremum.

    A threshold denoting the value level, up to wich clusters are agglomerated.

  • method (Literal['single', 'complete', 'average', 'weighted', 'centroid', 'median', 'ward'] (default: 'single')) –

    Linkage method used.

    The linkage method for hierarchical (agglomerative) clustering of the variables.

  • metric (Callable[[ndarray, ndarray], float] (default: <function DriftMixin.<lambda> at 0x7f22a7745080>)) –

    Metric of regime distances.

    A metric function for calculating the dissimilarity between 2 regimes. Defaults to the absolute difference in mean.

  • frac (float in [0, 1] (default: 0.5)) –

    Minimum variable portion for normal groups.

    The minimum percentage of samples, the “normal” group has to comprise to actually be the normal group. Must be in the closed interval [0,1], otherwise a ValueError is raised.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

assignUniLOF(field, n=20, algorithm='ball_tree', p=1, density='auto', fill_na=True, statistical_extent=1, **kwargs)#

Univariate outlier scoring, based on LOF.

Assign “univariate” Local Outlier Factor (LOF) or “inivariate” Local Outlier Probability (LOP)

The Function is a wrapper around a usual LOF implementation, aiming for an easy to use, parameter minimal outlier scoring function for singleton variables, that does not necessitate prior modelling of the variable. LOF is applied onto a concatenation of the field variable and a “temporal density”, or “penalty” variable, that measures temporal distance between data points.

See the Notes section for more details on the algorithm.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • n (int>0 (default: 20)) –

    Neighborhood size.

    Number of periods to be included into the LOF calculation. Defaults to 20, which is a value found to be suitable in the literature.

    • n determines the “locality” of an observation (its n nearest neighbors) and sets the upper limit of values of an outlier clusters (i.e. consecutive outliers). Outlier clusters of size greater than n/2 may not be detected reliably.

    • The larger n, the lesser the algorithm’s sensitivity to local outliers and small or singleton outliers points. Higher values greatly increase numerical costs.

  • algorithm (Literal['ball_tree', 'kd_tree', 'brute', 'auto'] (default: 'ball_tree')) –

    Nearest-neighbors search algorithm.

    Algorithm used for calculating the n-nearest neighbors needed for LOF calculation.

  • p (int>0 (default: 1)) –

    Distance metric degree.

    Degree of the metric (“Minkowski”), according to which distance to neighbors is determined. Most important values are:

    • 1 - Manhatten Metric

    • 2 - Euclidian Metric

  • density (Union[Literal['auto'], float>0] (default: 'auto')) –

    Time-axis differential form.

    How to calculate the temporal distance/density for the variable-to-be-flagged.

    • float - introduces linear density with an increment equal to density

    • Callable - calculates the density by applying the function passed onto the variable to be flagged (passed as Series).

  • fill_na (bool (default: True)) –

    Impute NaN values.

    If True, NaNs in the data are filled with a linear interpolation.

  • statistical_extent (float in [0, 1] (default: 1)) –

    Probability of Membership

    Controls the fuzzy-nes of a outlier clusters.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Notes

LOP: Kriegel, H.-P.; Kröger, P.; Schubert, E.; Zimek, A. LoOP: local outlier probabilities. Proceedings

of the 18th ACM conference on Information and knowledge management 2009, 1649–1652.

Algorithm steps for uniLOF flagging of variable x:

  1. The temporal density dt(x) is calculated according to the density parameter.

  2. LOF (or LOP) scores L(x) are calculated for the concatenation [x, dt(x)]

  3. x is flagged where L(x) exceeds the threshold determined by the parameter thresh.

assignZScore(field, window=None, norm_func='std', model_func='mean', center=True, min_periods=None, **kwargs)#

Calculate (rolling) Zscores.

See the Notes section for a detailed overview of the calculation

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • window (Optional[OffsetStr] (default: None)) –

    Scoring window size.

    If None (default), All data points share the same scoring window, which than equals the whole data.

  • model_func (Union[Callable, Literal['sum', 'mean', 'median', 'min', 'max', 'last', 'first', 'std', 'var', 'count', 'sem', 'linear', 'time']] (default: 'mean')) –

    Center moment function.

    Function to calculate the center moment (usually mean or median) in every window.

  • norm_func (Union[Callable, Literal['sum', 'mean', 'median', 'min', 'max', 'last', 'first', 'std', 'var', 'count', 'sem', 'linear', 'time']] (default: 'std')) –

    Scaling function.

    Function to calculate the scaling for every window

  • center (bool (default: True)) –

    Center windows around scored value.

    Weather or not to center the target value in the scoring window. If False, the target value is the last value in the window.

  • min_periods (Optional[int>=0] (default: None)) –

    Minimum Population per window.

    Minimum number of valid measurements in a scoring window, to consider the resulting score valid.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Notes

Steps of calculation:

1. Consider a window \(W\) of successive points \(W = x_{1},...x_{w}\) containing the value \(y_{K}\) wich is to be checked. (The index of \(K\) depends on the selection of the parameter center.)

  1. The “moment” \(M\) for the window gets calculated via \(M=\) model_func(\(W\))

  2. The “scaling” \(N\) for the window gets calculated via \(N=\) norm_func(\(W\))

  3. The “score” \(S\) for the point \(x_{k}`gets calculated via :math:`S=(x_{k} - M) / N\)

calculatePolynomialResiduals(field, window, order, min_periods=0, **kwargs)#

Residuals from polynomial fit.

The residuals are calculated by fitting a polynomial of degree order to a data slice of size window, that has x at its center.

Note, that calculating the residuals tends to be quite costly, because a function fitting is performed for every sample. To improve performance, consider the following possibilities:

In case your data is sampled at an equidistant frequency grid:

(1) If you know your data to have no significant number of missing values, or if you do not want to calculate residuals for windows containing missing values any way, performance can be increased by setting min_periods=window.

Note, that the initial and final window/2 values do not get fitted.

Each residual gets assigned the worst flag present in the interval of the original data.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • window (OffsetStr | int>0) –

    Extension of the fitting window.

    The size of the window you want to use for fitting. If an integer is passed, the size refers to the number of periods for every fitting window. If an offset string is passed, the size refers to the total temporal extension. The window will be centered around the vaule-to-be-fitted. For regularly sampled timeseries the period number will be casted down to an odd number if even.

  • order (int) – Degree of the fitted polynomial.

  • min_periods (int (default: 0)) –

    Minimum population for fitting windows.

    The minimum number of periods, that has to be available in every values fitting surrounding for the polynomial fit to be performed. If there are not enough values, np.nan gets assigned. Default (0) results in fitting regardless of the number of values present (results in overfitting for too sparse intervals). To automatically set the minimum number of periods to the number of values in an offset defined window size, pass np.nan.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

calculateRollingResiduals(field, window, func='mean', min_periods=0, center=True, **kwargs)#

Residuals from sliding window fit.

Note, that the data gets assigned the worst flag present in the original data.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • window (OffsetStr | int>0) –

    Rolling window size.

    If an integer is passed, the size refers to the number of periods for every fitting window. If an offset string is passed, the size refers to the total temporal extension. For regularly sampled timeseries, the period number will be casted down to an odd number if center=True.

  • func (Union[Callable[[Series], ndarray], Literal['sum', 'mean', 'median', 'min', 'max', 'last', 'first', 'std', 'var', 'count', 'sem', 'linear', 'time']] (default: 'mean')) –

    Aggregation function.

    Function that aggregates values from any rolling window.

  • min_periods (int>=0 (default: 0)) –

    Minimum population in rolling window.

    If a window has less than min_periods valid (not NaN) values, there wont be an aggregation calculated.

  • center (bool (default: True)) – Assign aggregation result to window center.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

calibratePipeline(field, problems, name=None, problem_labels=None, pop_size=100, termination=None, log_pop=None, log_config=None, log_path=None, verbose=True, **kwargs)#

Optimize problem pipeline against supervised field.

Parameters:
  • field (str) – Name of the input variable to process.

  • problems (list[str]) – Problem Statement of the Pipeline. Definition of the pipeline in terms of a list of (possibly merged) Problems.

  • name (str (default: None)) – Name of the Pipeline Sets the name, the resulting Pipeline is accessible through, as a new SaQC method.

  • problem_labels (list[str] (default: None)) – Target (column) Labels for the Pipeline. If None (default): all flags get squashed to a merged target (possibly to be iterated over by sequential pipeline) If []: fallback to supervision. If a list of length thats matching the length of the problems list, problems will be optimised against it sequentially If length of labels list is exactly 1 (or None), all problems will be fit against this single label sequencially. If a label is listed, thats not present in the history, GUI assignment is initialised.

  • log_pop (bool (default: None)) – Weather to log the whole population generated during training to log_path. Defaults to True as long as valid log_path is given.

  • log_config (bool (default: None)) – Weather to log the optimized parameterset as a config file to log_path. Defaults to True as long as valid log_path is given.

  • log_path (str (default: None)) – Path to the Logs. (folder path), If not given (None) logging wont happen. If path already exists, content will be overridden.

  • pop_size (int (default: 100)) – Population Size to optimize with.

  • termination (tuple | int (default: None)) – Determine Termination for the Optimisations. If None (default) - fall back to pymoo defaults (computationally exhaustive, but most likely not terminating too early). If an integer, will be interpreted simply as the maximal number of evaluations.

  • verbose (bool (default: True)) – Logging verbosity. Wanna see whats happening while waiting for Results?

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Notes

Either name or both, log_path and config_path should be assigned since otherwise optimised pipeline gets lost after run.

clearFlags(field, **kwargs)#

Assign UNFLAGGED to all periods.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Notes

Function ignores dfilter keyword.

concatFlags(field, target=None, method='auto', invert=True, freq=None, drop=False, squeeze=False, override=False, **kwargs)#

Resample a variables flags and append them to another variables flags.

Project flags/history of field to target and adjust to the frequency grid of target by ‘undoing’ former interpolation, shifting or resampling operations

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • method (Literal['fagg', 'bagg', 'nagg', 'fshift', 'bshift', 'nshift', 'sshift', 'mshift', 'match', 'auto', 'linear', 'pad'] (default: 'auto')) –

    Aggregation method.

    Method to project the flags of field to the flags to target:

    • 'auto': invert the last alignment/resampling operation (that is not already inverted)

    • 'nagg': project a flag of field to all timestamps of target within the range +/- freq/2.

    • 'bagg': project a flag of field to all preceeding timestamps of target within the range freq

    • 'fagg': project a flag of field to all succeeding timestamps of target within the range freq

    • 'interpolation' - project a flag of field to all timestamps of target within the range +/- freq

    • 'sshift' - same as interpolation

    • 'nshift' - project a flag of field to the neaerest timestamps in target within the range +/- freq/2

    • 'bshift' - project a flag of field to nearest preceeding timestamps in target

    • 'nshift' - project a flag of field to nearest succeeding timestamps in target

    • 'match' - project a flag of field to all identical timestamps target

  • invert (bool (default: True)) –

    Apply inverse of selected method.

    If True, not the actual method is applied, but its inversion-method.

  • freq (UnionType[FreqStr, Timedelta, None] (default: None)) –

    Reindexing scope.

    Projection range. If None the sampling frequency of field is used.

  • drop (bool (default: False)) –

    Drop field.

    Remove field if True

  • squeeze (bool (default: False)) –

    Squash history.

    Squeeze the history into a single column if True, function specific flag information is lost.

  • override (bool (default: False)) – Override existing flags.

  • target (SaQCFields | newSaQCFields) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Notes

To just use the appropriate inversion with regard to a certain method, set the invert parameter to True and pass the method you want to invert.

To backtrack a preveous resampling, shifting or interpolation operation automatically, set method=’auto’

copy(deep=True)[source]#
copyField(field, target, overwrite=False, **kwargs)#

Copy data and flags.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • overwrite (bool (default: False)) – Overwrite target.

  • target (SaQCFields | newSaQCFields) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

correctDrift(field, maintenance_field, model, cal_range=5, **kwargs)#

Correct model defined drifts.

See the Notes section for an overview over the correction algorithm.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • maintenance_field (SaQCColumns) –

    Support-points variable.

    The data is expected to have the following form: The index of the series represents the beginning of a maintenance event, wheras the values represent its endings.

  • model (Union[CurveFitter, Literal['linear', 'exponential']]) –

    Correction model.

    A model function describing the drift behavior, that is to be corrected. Either use built-in exponential or linear drift model by passing a string, or pass a custom callable. The model function must always contain the keyword parameters ‘origin’ and ‘target’. The starting parameter must always be the parameter, by wich the data is passed to the model. After the data parameter, there can occure an arbitrary number of model calibration arguments in the signature. See the Notes section for an extensive description.

  • cal_range (int>=0 (default: 5)) –

    Calibration range.

    Number of values to calculate the mean of, for obtaining the value level directly after and directly before a maintenance event. Needed for shift calibration.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Notes

It is assumed, that between support points, there is a drift effect shifting the meassurements in a way, that can be described, by a model function M(t, p, origin, target). (With 0<=t<=1, p being a parameter set, and origin, target being floats).

Note, that its possible for the model to have no free parameters p at all (linear drift mainly).

The drift model, directly after the last support point (t=0), should evaluate to the origin - calibration level (origin), and directly before the next support point (t=1), it should evaluate to the target calibration level (target).

M(0, p, origin, target) = origin M(1, p, origin, target) = target

The model is than fitted to any data chunk in between support points, by optimizing the parameters p, and thus, obtaining optimal parameterset P.

The new values at t are computed via::

new_vals(t) = old_vals(t) + M(t, P, origin, target) - M_drift(t, P, origin, new_target)

Wheras new_target represents the value level immediately after the next support point.

Examples

Some examples of meaningful driftmodels.

Linear drift modell (no free parameters).

>>> Model = lambda t, origin, target: origin + t*target

exponential drift model (exponential raise!)

>>> expFunc = lambda t, a, b, c: a + b * (np.exp(c * x) - 1)
>>> Model = lambda t, p, origin, target: expFunc(t, (target - origin) / (np.exp(abs(c)) - 1), abs(c))

Exponential and linear driftmodels are part of the ts_operators library, under the names expDriftModel and linearDriftModel.

correctOffset(field, max_jump, spread, window, min_periods, tolerance=None, **kwargs)#

Correct offsets to normal value level.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • max_jump (float>=0) –

    Separating threshold for offsets.

    When searching for changepoints in mean - this is the threshold a mean difference in the sliding window search must exceed to trigger changepoint detection.

  • spread (float>=0) –

    Divergence threshold for offsets.

    threshold denoting the maximum, regimes are allowed to abolutely differ in their means to form the “normal group” of values.

  • window (OffsetStr) –

    Context size for mean value levels.

    Size of the adjacent windows that are used to search for the mean changepoints.

  • min_periods (int>=0) –

    Minimum population size for windows.

    Minimum number of periods a search window has to contain, for the result of the changepoint detection to be considered valid.

  • tolerance (Optional[OffsetStr] (default: None)) –

    Neglected data chunks at window bounds.

    If an offset string is passed, a data chunk of length offset right from the start and right before the end of any regime is ignored when calculating a regimes mean for data correcture. This is to account for the unrelyability of data near the changepoints of regimes.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

correctRegimeAnomaly(field, cluster_field, model, tolerance=None, epoch=False, **kwargs)#

Regimen wise model fitting.

Function fits the passed model to the different regimes in data[field] and tries to correct those values, that have assigned a negative label by data[cluster_field].

Currently, the only correction mode supported is the “parameter propagation.”

This means, any regime \(z\), labeled negatively and being modeled by the parameters p, gets corrected via:

\(z_{correct} = z + (m(p^*) - m(p))\),

where \(p^*\) denotes the parameter set belonging to the fit of the nearest not-negatively labeled cluster.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • cluster_field (SaQCColumns) –

    Cluster labels variable.

    A string denoting the field in data, holding the cluster label for the data you want to correct.

  • model (CurveFitter) –

    Model function.

    The model function to be fitted to the regimes. It must be a function of the form \(f(x, *p)\), where \(x\) is the numpy.array holding the independent variables and \(p\) are the model parameters that are to be obtained by fitting. Depending on the x_date parameter, independent variable x will either be the timestamps of every regime transformed to seconds from epoch, or it will be just seconds, counting the regimes length.

  • tolerance (Optional[OffsetStr] (default: None)) –

    Ignored window of initial and final values in chunks.

    If an offset string is passed, a data chunk of length offset right at the start and right at the end is ignored when fitting the model. This is to account for the unreliability of data near the changepoints of regimes. Defaults to None.

  • epoch (bool (default: False)) –

    Use epoch or seconds.

    If True, use “seconds from epoch” as x input to the model func, instead of “seconds from regime start”.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

dropField(field, **kwargs)#

Drop field.

Removes data and flags represented by field from the variables.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

fitLowpassFilter(field, cutoff, nyq=0.5, filter_order=2, fill_method='linear', **kwargs)#

Filter and smooth data with Butterworth filter.

Derive a smoothed version of the data by cutting off frequencies of its spectral representation that exceed a cutoff frequency.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • cutoff (float>=0 | FreqStr) –

    The cutoff-frequency.

    Has to be either an offset freq string, or be expressed in multiples of the sampling rate.

  • nyq (float>=0 (default: 0.5)) –

    The niquist-frequency.

    expressed in multiples if the sampling rate.

  • fill_method (Literal['linear', 'nearest', 'zero', 'slinear', 'quadratic', 'cubic', 'spline', 'barycentric', 'polynomial'] (default: 'linear')) –

    Fill method applied pre-filtering.

    Since butterworth filtering cant handle np.nan values or irregularly sampled data, an imputation method for gaps should be assigned here. See documentation of pandas.Series.interpolate method for details on the methods associated with the different keywords.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Notes

The data is expected to be regularly sampled.

fitMomentFM(field, ratio=4, context=512, agg='mean', model_spec=None, **kwargs)#

Moment Foundational Timeseries Model (MomentFM).

The function applies MomentFM [1] in its reconstruction mode on a window of size context, striding through the data with step size context/ratio

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • ratio (int (default: 4)) –

    Sample size for value reconstruction.

    The number of samples generated for any values reconstruction. Must be a divisor of context. Effectively controlls the stride-width of the reconstruction window through the data.

  • context (int (default: 512)) –

    Reconstruction context.

    Size of the context window with regard to wich any value is reconstructed.

  • agg (Literal['center', 'mean', 'median', 'std'] (default: 'mean')) –

    Sample aggregation method.

    How to aggregate the different reconstructions for the same value. * ‘center’: use the value that was constructed in a window centering around the origin value * ‘mean’: assign the mean over all reconstructed values * ‘median’: assign the median over all reconstructed values * ‘std’: assign the standard deviation over all reconstructed values

  • model_spec (Optional[dict] (default: None)) –

    Model specification. Dictionary with the fields: * pretrained_model_name_or_path * revision

    Defaults to global Parameter DEFAULT_MOMENT=dict(pretrained_model_name_or_path=”AutonLab/MOMENT-1-large”, revision=”main”

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Examples

../_images/fitFMpic.png

Notes

[1] https://arxiv.org/abs/2402.03885 [2] moment-timeseries-foundation-model/moment

fitPolynomial(field, window, order, min_periods=0, **kwargs)#

Fit a polynomial model to the data.

The fit is calculated by fitting a polynomial of degree order to a data slice of extension window, centered around each timestamp.

For regularly sampled data:

  • If missing values are rare or residuals for windows with missing values are not needed, performance can be increased by setting min_periods=window.

  • The initial and final window//2 timestamps do not get fitted.

  • Each residual is assigned the worst flag present in the corresponding interval of the original data.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • window (OffsetStr | int>=0) –

    Extension of the fitting window.

    If an integer is passed, it represents the number of timestamps in each window. If an offset string is passed, it represents the window’s temporal extent. The window is centered around the timestamp being fitted. For uniformly sampled data, an odd number of timestamps is always used to constitute a window (subtracted by 1, if the total is even).

  • order (int>=1) – Degree of the fitted polynomial.

  • min_periods (int>=0 (default: 0)) –

    Minimum population for fitting windows.

    Windows with fewer timestamps will result in NaN valued smoothing points. Passing 0 disables this check and may result in overfitting for sparse windows.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

flagByClick(field, max_gap=None, gui_mode='GUI', selection_marker_kwargs=None, dfilter=255.0, **kwargs)#

Graphical user interface for flags assignment.

Pop up GUI for adding or removing flags by selection of points in the data plot.

  • Left click and Drag the selection area over the points you want to add to selection.

  • Right clack and drag the selection area over the points you want to remove from selection

  • press ‘shift’ to switch between rectangle and span selector

  • press ‘enter’ or click “Assign Flags” to assign flags to the selected points and end session

  • press ‘escape’ or click “Discard” to end Session without assigneing flags to selection

  • activate the sliders attached to each axes to bind the respective variable. When using the span selector, points from all bound variables will be added synchronously.

Note, that you can only mark already flagged values, if dfilter is set accordingly.

Note, that you can use flagByClick to “unflag” already flagged values, when setting dfilter above the flag to “unset”, and setting flag to a flagging level associated with your “unflagged” level.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • max_gap (Optional[OffsetStr] (default: None)) –

    Limit for plotting gaps.

    If None, all data points will be connected, resulting in long linear lines, in case of large data gaps. NaN values will be removed before plotting. If an offset string is passed, only points that have a distance below max_gap are connected via the plotting line.

  • gui_mode (Literal['GUI', 'overlay'] (default: 'GUI')) –

    Mode of the gui.

    • "GUI" (default), spawns TK based pop-up GUI, enabling scrolling and binding for subplots

    • "overlay", spawns matplotlib based pop-up GUI. May be less conflicting, but does not support scrolling or binding.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

flagByScatterLowpass(field, window, thresh, func='std', sub_window=None, sub_thresh=None, min_periods=None, flag=255.0, **kwargs)#

Flag data noisy data.

Breaks up data in chunks and flags those chunks, if data there is too scattered. See notes section for algorithm details.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • func (Union[Literal['std', 'var', 'mad'], Callable[[ndarray, Series], float]] (default: 'std')) –

    Scatter statistic.

    A function that assigns each data chunk its scattering. * "std" — standard deviation * "var" — variance * "mad" — median absolute deviation * Callable — custom function mapping 1D arrays to scalars.

  • window (OffsetStr | Timedelta) –

    Scatter context size.

    The extension of the window, the scattering (usually variance) will be computed from, for each period.

  • thresh (float>=0) –

    Scattering upper bound.

    If the scatter statistic obtained from a window of size window exceeds thresh, the value centered in the window is flagged.

  • sub_window (UnionType[OffsetStr, Timedelta, None] (default: None)) –

    Size of partitions of the scatter context.

    The window determining the context for the scatter statistics calculation is divided up into disjoint sub windows of size sub_window, where the scattering is tested to exceed sub_thresh, in order to finally trigger flagging.

  • sub_thresh (Optional[float>=0] (default: None)) –

    Scattering upper bound on sub window.

    Threshold, the statistic on every sub chunk is checked against. func(sub_chunk) > sub_thresh.

  • min_periods (Optional[int>=0] (default: None)) –

    Minimum window population.

    Ignored if window is an integer.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Notes

Chunks of length window are flagged if:

  1. They exceed thresh according to the function func.

  2. All (possibly overlapping) sub-chunks of length sub_window exceed sub_thresh according to the same function.

flagByStray(field, window=None, min_periods=11, iter_start=0.5, alpha=0.05, flag=255.0, **kwargs)#

Flag outliers with the STRAY Algorithm.

For more details about the algorithm please refer to [1].

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • window (UnionType[OffsetStr, int>=1, None] (default: None)) –

    Window size for STRAY.

    Determines the segmentation of the data into partitions, the STRAY algorithm is applied onto individually.

    • None: Apply Scoring on whole data set at once

    • int: Apply scoring on successive data chunks of periods with the given length. Must be greater than 0.

    • offset String : Apply scoring on successive partitions of temporal extension matching the passed offset string

  • min_periods (int>=1 (default: 11)) –

    Minimum population per window.

    Minimum number of periods per partition that have to be present for a valid outlier detection to be made in this partition

  • iter_start (float in [0, 1] (default: 0.5)) –

    Portion of normal data.

    Determines which percentage of data is considered “normal”. 0.5 results in the stray algorithm to search only the upper 50% of the scores for the cut off point. (See reference section for more information)

  • alpha (float in [0, 1] (default: 0.05)) –

    Significance level.

    Level of significance by which it is tested, if a score might be drawn from another distribution than the majority of the data.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

References

[1] Priyanga Dilini Talagala, Rob J. Hyndman & Kate Smith-Miles (2021):

Anomaly Detection in High-Dimensional Data, Journal of Computational and Graphical Statistics, 30:2, 360-374, DOI: 10.1080/10618600.2020.1807997

flagByVariance(field, window, thresh, maxna=None, maxna_group=None, flag=255.0, **kwargs)#

Flag low-variance data.

Flags plateaus of constant data if the variance in a rolling window does not exceed a certain threshold.

Any interval of values y(t),..y(t+n) is flagged, if:

  1. n > window

  2. variance(y(t),…,y(t+n) < thresh

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • window (OffsetStr) –

    Size of the context window for variance.

    This is the number of observations used for calculating the variance. Each window will be a fixed size. If its an offset then this will be the time period of each window. Each window will be sized, based on the number of observations included in the time-period.

  • thresh (float>=0) – Maximum total variance allowed per window.

  • maxna (Optional[int>=0] (default: None)) –

    Maximum number of NaNs allowed in window.

    If more NaNs are present, the window is underpopulated and wont trigger any flagging.

  • maxna_group (Optional[int>=0] (default: None)) –

    Maximum number of consecutive NaNs allowed in window.

    If more consecutive NaNs are present, the window is underpopulated and wont trigger any flagging.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

flagChangePoints(field, stat_func, thresh_func, window, min_periods, reduce_window=None, reduce_func=<function ChangepointsMixin.<lambda>>, flag=255.0, **kwargs)#

Flag values that represent a system state transition.

Flag data points, where the parametrization of the assumed process generating this data, significantly changes.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • stat_func (Callable[[ndarray, ndarray], float]) – Aggregation function for rolling twin windows. A function that assigns a value to every twin window. The backward-facing window content will be passed as the first array, the forward-facing window content as the second.

  • thresh_func (Callable[[ndarray, ndarray], float]) – Threshold function for rolling twin windows. A function that determines the value level, exceeding wich qualifies a timestamps func value as denoting a change-point.

  • window (OffsetStr | tuple[OffsetStr, OffsetStr]) –

    Size of the moving twin windows. This is the number of observations used for calculating the statistic.

    If it is a single frequency offset, it applies for the backward- and the forward-facing window.

    If two offsets (as a tuple) is passed the first defines the size of the backward facing window, the second the size of the forward facing window.

  • min_periods (int>=0 | tuple[int>=0, int>=0]) – Minimum population required in every window. Minimum number of observations in a window required to perform the changepoint test. If it is a tuple of two int, the first refer to the backward-, the second to the forward-facing window.

  • reduce_window (Optional[OffsetStr] (default: None)) –

    Merge adjacent changepoints. The sliding window search method is not an exact CP search method and usually there wont be detected a single changepoint, but a “region” of change around a changepoint.

    If reduce_window is given, for every window of size reduce_window, there will be selected the value with index reduce_func(x, y) and the others will be dropped.

    If reduce_window is None, the reduction window size equals the twin window size, the changepoints have been detected with.

  • reduce_func (Callable[[ndarray, ndarray], int] (default: <function ChangepointsMixin.<lambda> at 0x7f22aa8ca340>)) – Merge-function for adjacent changepoints. A function that must return an index value upon input of two arrays x and y. First input parameter will hold the result from the stat_func evaluation for every reduction window. Second input parameter holds the result from the thresh_func evaluation. The default reduction function just selects the value that maximizes the stat_func.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

flagConstants(field, thresh, window, min_periods=2, flag=255.0, **kwargs)#

Flag constant data values.

Flags plateaus of constant data if their maximum total change in a rolling window does not exceed a certain threshold.

Any interval of values y(t), …, y(t+n) is flagged if:
  • n > window

  • abs(y(t + i) - y(t + j)) < thresh for all i, j in [0, 1, …, n]

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • thresh (float>=0) – Maximum total change allowed per window.

  • window (OffsetStr | int>=1) – Size of the rolling window. If an integer is passed, it represents the number of timestamps per window. If an offset string is passed, it represents the windows total temporal extent.

  • min_periods (int>=0 (default: 2)) – Minimum number of valid timestamps that are necessary to be present in any window, in order to trigger condition testing for this window. Windows with fewer timestamps are skipped. Must be >= 2, because a single value is always considered constant.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

flagDriftFromNorm(field, window, spread, frac=0.5, metric=<function cityblock>, method='single', flag=255.0, **kwargs)#

Flags Deviation from central moment.

“Normality” is determined in terms of a maximum spreading distance, that members of a normal group must not exceed. In addition, only a group is considered “normal” if it contains more than frac percent of the variables in “field”.

See the Notes section for a more detailed presentation of the algorithm.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • window (OffsetStr) –

    Chunk size.

    The data will be broken up into chunks of size window and separately, on any of those chunks, the flagging statistics will be calculated.

  • spread (float>=0) –

    Expected spreading maximum.

    Given a set of timeseries representing the same variable, this determines the maximum spreading expected from those. See Notes section for more details.

  • frac (float in [0, 1] (default: 0.5)) –

    Portion threshold for normal group.

    The group identified as exposing normal behavior, must contain frac percentage of the targeted variables to be valid and trigger flagging of the variables not included. The higher the value, the more stable the algorithm will be. For values below 0.5 the results are undefined.

  • metric (Callable[[ndarray | Series, ndarray | Series], ndarray] (default: <function cityblock at 0x7f22a7744540>)) –

    Distance metric.

    Distance function that takes two arrays as input and returns a scalar float. This value is interpreted as the distance of the two input arrays. Defaults to the averaged manhattan metric (see Notes).

  • method (Literal['single', 'complete', 'average', 'weighted', 'centroid', 'median', 'ward'] (default: 'single')) –

    Linkage method.

    The linkage method used for hierarchical (agglomerative) clustering of the data. method is directly passed to scipy.hierarchy.linkage. See its documentation [1] for more details. For a general introduction on hierarchical clustering see [2].

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Notes

following steps are performed for every data “segment” of length window in order to find the “abnormal” data:

  1. Calculate distances \(d(x_i,x_j)\) for all \(x_i\) in parameter field. (with \(d\) denoting the distance function, specified by metric.

  2. Calculate a dendogram with a hierarchical linkage algorithm, specified by method.

  3. Flatten the dendogram at the level, the agglomeration costs exceed spread

  4. check if a cluster containing more than frac variables.

    1. if yes: flag all the variables that are not in that cluster (inside the segment)

    2. if no: flag nothing

The main parameter giving control over the algorithms behavior is the spread parameter, that determines the maximum spread of a normal group by limiting the costs, a cluster agglomeration must not exceed in every linkage step. For singleton clusters, that costs just equal half the distance, the data in the clusters, have to each other. So, no data can be clustered together, that are more then 2*`spread` distances away from each other. When data get clustered together, this new clusters distance to all the other data/clusters is calculated according to the linkage method specified by method. By default, it is the minimum distance, the members of the clusters have to each other. Having that in mind, it is advisable to choose a distance function, that can be well interpreted in the units dimension of the measurement and where the interpretation is invariant over the length of the data. That is, why, the “averaged manhattan metric” is set as the metric default, since it corresponds to the averaged value distance, two data sets have (as opposed by euclidean, for example).

References

Documentation of the underlying hierarchical clustering algorithm:

[1] https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html

Introduction to Hierarchical clustering:

[2] https://en.wikipedia.org/wiki/Hierarchical_clustering

flagDriftFromReference(field, reference, freq, thresh, metric=<function cityblock>, flag=255.0, **kwargs)#

Flags data that deviates from a reference.

Deviation is measured by a custom distance function.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • freq (FreqStr) – Chunk size.

  • reference (SaQCColumns) –

    Reference variable.

    Deviation is calculated as deviation/distance from the timeseries registered under reference to the saqc object.

  • thresh (float>=0) –

    Maximum distance from reference.

    Data with distance to reference, that exceeds thresh according to metric, is flagged.

  • metric (Callable[[ndarray | Series, ndarray | Series], ndarray] (default: <function cityblock at 0x7f22a7744540>)) –

    Distance function.

    Takes two arrays as input and returns a scalar float. This value is interpreted as the mutual distance of the two input arrays. Defaults to the averaged manhattan metric (see Notes).

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Notes

It is advisable to choose a distance function, that can be well interpreted in the units dimension of the measurement and where the interpretation is invariant over the length of the data. That is, why, the “averaged manhatten metric” is set as the metric default, since it corresponds to the averaged value distance, two data sets have (as opposed by euclidean, for example).

flagDummy(field, **kwargs)#

Pass on data and flags.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

flagGeneric(field, func, target=None, flag=255.0, **kwargs)#

Apply custom flagging rule.

Boolean valued function func will be applied to the timeseries represented by field and the result will be interpreted as flags and assigned to the variable target.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • func (GenericFunction) –

    Function that assignes the flags.

    This function is expected to map input data series to a boolean series/array of the same size. If field has multiple values, those will be mapped monotounosly to the multiple arguments of func. The number of arguments, func implements, must match the number of elements in field.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Examples

  1. Flag the variable ‘rainfall’, if the sum of the variables ‘temperature’ and ‘uncertainty’ is below zero:

qc.flagGeneric(field=["temperature", "uncertainty"], target="rainfall", func= lambda x, y: x + y < 0)
  1. Flag the variable ‘temperature’, where the variable ‘fan’ is flagged:

qc.flagGeneric(field="fan", target="temperature", func=lambda x: isflagged(x))
  1. The generic functions also support all pandas and numpy functions:

qc = qc.flagGeneric(field="fan", target="temperature", func=lambda x: np.sqrt(x) < 7)
flagIsolated(field, gap_window, group_window, flag=255.0, **kwargs)#

Flag groups of data that are surrounded by data gaps.

The function flags groups of values that are surrounded by sufficiently large data gaps. A data gap is a timespan containing no valid data. (Data is valid if it is not NaN and if it is not assigned a flag with a level higher than the functions flag value).

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • gap_window (OffsetStr) –

    Minimum missing data extension that qualifies a gap.

    Minimum gap size required before and after a data group to consider it isolated. See conditions (2) and (3) below.

  • group_window (OffsetStr) –

    Maximum data extension that qualifies an isolated group.

    Maximum size of a data chunk to consider it a candidate for an isolated group. Data chunks larger than this are ignored. This does not include the possible gaps surrounding it. See condition (1) below.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Notes

A series of values \(x_k, x_{k+1}, ..., x_{k+n}\) with timestamps \(t_k, t_{k+1}, ..., t_{k+n}\) is considered isolated if:

  1. \(t_{k+1} - t_n <\) group_window

  2. No valid values in a succeeding period of gap_window extension.

  3. No valid values exist in the succeeding gap of size gap_window.

flagJumps(field, thresh, window, min_periods=0, flag=255.0, dfilter=-inf, **kwargs)#

Flag jumps and drops in data.

Flags values where the mean changes significantly between two adjacent rolling windows, indicating a “jump” from one level to another. Whenever the difference between the means of the two windows exceeds thresh, the values between the windows are flagged.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • thresh (float>=0) –

    Threshold for value change, to qualify as jump/drop.

    The Threshold by which the mean of data must differ in two adjacent windows to trigger flagging.

  • window (OffsetStr) –

    Context size for value level mean.

    Determines the number of timestamps used for calculating the mean in each window. Windows should be chosen large enough to obtain a reliable mean. But not too large as well, since the window size implies a lower bound for the detection resolution. Jumps exceeding thresh but being apart from each other by less than 3/4 of the window size may not be detected reliably.

  • min_periods (int>=0 (default: 0)) –

    Minimum population size required in every window.

    Minimum number of timestamps in window required to calculate a valid mean. If no valid mean for the window can be calculated, flagging wont be triggered for the associated change point.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Notes

Jumps closer together than three fourths (3/4) of the window size may not be detected reliably.

Examples

Below diagram illustrates the interaction of parameters for a positive value jump initializing a new mean level.

../_images/flagJumpsPic.png

The two adjacent windows of size window roll through the data series. Whenever the mean values differ by more than thresh, flagging is triggered.#

flagLOF(field, n=20, thresh=1.5, algorithm='ball_tree', p=1, flag=255.0, **kwargs)#

Local Outlier Factor.

Flag values where the Local Outlier Factor (LOF) exceeds a given cutoff.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • n (int>0 (default: 20)) –

    Nearest neighbors number.

    Number of nearest neighbors to be included into the LOF calculation. Defaults to 20, which is a value found to be suitable in the literature.

    • n determines the “locality” of an observation (its n nearest neighbors) and sets the upper limit to the number of values in outlier clusters (i.e. consecutive outliers). Outlier clusters of size greater than n/2 may not be detected reliably.

    • The larger n, the lesser the algorithm’s sensitivity to local outliers and small or singleton outliers points. Higher values greatly increase numerical costs.

  • thresh (Union[Literal['auto'], float>=1] (default: 1.5)) –

    Cutoff threshold.

    The threshold for flagging the calculated LOF. A LOF of around 1 is considered normal and most likely corresponds to inlier points.

    • The “automatic” threshing introduced with the publication of the algorithm defaults to 1.5.

    • In this implementation, thresh defaults ('auto') to flagging the scores with a modified 3-sigma rule.

  • algorithm (Literal['ball_tree', 'kd_tree', 'brute', 'auto'] (default: 'ball_tree')) –

    NN-algorithm.

    Algorithm used for calculating the n-nearest neighbors.

  • p (int>0 (default: 1)) –

    Minkowski degree.

    Degree of the metric (“Minkowski”), according to which the distance to neighbors is determined. Most important values are:

    • 1 - Manhattan Metric

    • 2 - Euclidian Metric

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Notes

  • The flagLOF() function calculates the Local Outlier Factor (LOF) for every point in the input timeseries. The LOF is a scalar value, that roughly correlates to the reachability, or “outlierishnes” of the evaluated datapoint. If a point is as reachable, as all its n-nearest neighbors, the LOF score evaluates to around 1. If it is only as half as reachable as all its n-nearest neighbors are (so to say, as double as “outlierish”), the score is about 2. So, the Local Outlier Factor relates a point’s reachability to the reachability of its n-nearest neighbors in a multiplicative fashion (as a “factor”).

  • The reachability of a point thereby is determined as an aggregation of the points distances to its n-nearest neighbors, measured with regard to the minkowski metric of degree p (usually euclidean).

  • To derive a binary label for every point (outlier: yes, or no), the scores are cut off at a level, determined by thresh.

flagMissing(field, flag=255.0, dfilter=-inf, **kwargs)#

Deprecated since version 2.7.0: Deprecated Function. Please use to flagNaN() instead.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

flagNAN(field, flag=255.0, dfilter=-inf, **kwargs)#

Flag NaNs in data.

By default, only NaNs are flagged, that not already have a flag. dfilter can be used to pass a flag that is used as threshold. Each flag worse than the threshold is replaced by the function. This is, because the data gets masked (with NaNs) before the function evaluates the NaNs.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

flagOffset(field, window, tolerance=None, thresh=None, thresh_relative=None, flag=255.0, **kwargs)#

Flag offsetted groups of values.

This test classifies values or value sequences as outliers by detecting abrupt rises and subsequent returns to the original value level within a given amount of time. Both single-value spikes and plateau-like sequences are detected.

Values \(x_n, x_{n+1}, ..., x_{n+k}\) with timestamps \(t_n, t_{n+1}, ..., t_{n+k}\) are considered offsets if:

  1. \(|x_{n-1} - x_{n+s}| >\) thresh for all \(s \in [0, ..., k]\)

  2. If thresh_relative > 0, \(x_{n+s} > x_{n-1}*(1 + thresh_relative)\)

  3. If thresh_relative < 0, \(x_{n+s} < x_{n-1}*(1 + thresh_relative)\)

  4. \(|x_{n-1} - x_{n+k+1}| <\) tolerance

  5. \(|t_{n-1} - t_{n+k+1}| <\) window

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • window (FreqStr) –

    Offset group size (maximum).

    Maximum temporal length allowed for an offset sequence to trigger flagging (condition 5). Integer-defined windows are only allowed for regularly sampled timestamps.

  • tolerance (Optional[float>=0] (default: None)) –

    Maximum difference between offset footpoints.

    Maximum allowed difference between the value preceding and succeeding an offset sequence to trigger flagging (condition 4).

  • thresh (Optional[float>=0] (default: None)) –

    Offsetting threshold.

    Minimum absolute difference between a value and its successors to consider the successors a possible anomalous offset sequence (condition 1). If None, this condition is ignored.

  • thresh_relative (UnionType[tuple, float, None] (default: None)) –

    Relative offsetting threshold.

    Minimum relative change between a value and its successors to consider the successors a possible anomalous offset sequence (conditions 2 and 3). If None, this condition is ignored. The parameter constrains the detection to either upwards (positive value passed) or downwards (negative values passed) offsets. To assign detection of offsets bigger than a, positive as well as negative, pass the tuple (a,-a). Differing positive and negative threshold values are possible as well. See condition (2). If None, condition (2) is not tested.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Examples

Below picture gives an abstract interpretation of the parameter interplay in case of a positive value jump, initializing an offset course.

../_images/flagOffsetPic.png

The four values marked red are flagged because (1) the initial value jump exceeds thresh, (2) the temporal extension of the group does not exceed window, and (3) the returning value after the group lies within tolerance distance from the initial one.#

Lets generate a simple, regularly sampled timeseries with an hourly sampling rate and generate an saqc.SaQC instance from it.

>>> import saqc
>>> data = pd.DataFrame({'data':np.array([5,5,8,16,17,7,4,4,4,1,1,4])}, index=pd.date_range('2000',freq='1h', periods=12))
>>> data
                     data
2000-01-01 00:00:00     5
2000-01-01 01:00:00     5
2000-01-01 02:00:00     8
2000-01-01 03:00:00    16
2000-01-01 04:00:00    17
2000-01-01 05:00:00     7
2000-01-01 06:00:00     4
2000-01-01 07:00:00     4
2000-01-01 08:00:00     4
2000-01-01 09:00:00     1
2000-01-01 10:00:00     1
2000-01-01 11:00:00     4
>>> qc = saqc.SaQC(data)

Now we are applying flagOffset() and try to flag offset courses that don’t extend longer than 6 hours in time (window) and that have an initial value jump higher than 2 (thresh), and that return to the initial value level within a tolerance of 1.5 (tolerance).

>>> qc = qc.flagOffset(field="data", thresh=2, tolerance=1.5, window="6h")
>>> qc.plot(field="data")  

Note that both negative and positive jumps are considered starting points of offsets. To restrict detection to positive jumps only, use thresh_relative > 0:

>>> qc = qc.flagOffset(field="data", thresh=2, thresh_relative=.9, tolerance=1.5, window='6h')
>>> qc.plot(field="data")  

To detect only negative offsets, use a negative relative threshold:

>>> qc = qc.flagOffset(field="data", thresh=2, thresh_relative=-.5, tolerance=1.5, window="6h")
>>> qc.plot(field="data")  
flagPatternByDTW(field, reference, max_distance=0.0, normalize=True, plot=False, flag=255.0, **kwargs)#

Pattern Recognition with DTW metric.

Identify stretched or squeezed versions of a pattern by evaluating their difference according to a metric constructed with the Dynamic Time Warping (DTW) algorithm.

The steps are: 1. work on a moving window

  1. for each data chunk extracted from each window, a distance to the given pattern is calculated, by the dynamic time warping algorithm [1]

  2. if the distance is below the threshold, all the data in the window gets flagged

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • reference (SaQCColumns) –

    Variable holding the pattern.

    The pattern must not have NaNs.

  • max_distance (float>=0 (default: 0.0)) –

    DTW-distance limit.

    Maximum dtw-distance between chunk and pattern, for chunk to be interpreted as an instance of the pattern.

  • normalize (bool (default: True)) –

    DTW distance normalisation.

    If False, return unmodified distances. If True, normalize distances by the number of observations of the reference. This helps to make it easier to find a good cutoff threshold for further processing. The distances then refer to the mean distance per datapoint, expressed in the datas units.

  • plot (bool (default: False)) –

    Calibration plot.

    Show a calibration plot, which can be quite helpful to find the right threshold for max_distance. It works best with normalize=True. Do not use in automatic setups / pipelines. The plot show three lines:

    • data: the data the function was called on

    • distances: the calculated distances by the algorithm

    • indicator: have to distinct levels: 0 and the value of max_distance. If max_distance is 0.0 it defaults to 1. Everywhere where the indicator is not 0 the data will be flagged.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Notes

The window size of the moving window is set to equal the temporal extension of the reference datas datetime index.

References

Find a nice description of underlying the Dynamic Time Warping Algorithm here:

[1] https://cran.r-project.org/web/packages/dtw/dtw.pdf

flagPlateau(field, min_length, max_length=None, min_jump=None, granularity=None, flag=255.0, **kwargs)#

Flag ofsetted groups of data..

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • min_length (int>0 | OffsetStr) – Size threshold for offseted groups. Minimum temporal extension of a value course to qualify as a plateau.

  • max_length (UnionType[int>0, OffsetStr, None] (default: None)) – Size limit for offsetted groups. Maximum temporal extension of a value course to qualify as a plateau (upper detection limit). If None, a detection limit based on the data length is used.

  • min_jump (Optional[float>=0] (default: None)) – Offset (jump) threshold. Minimum difference a plateau must have from directly preceding and succeeding periods. If None, the minimum jump threshold is derived automatically from the median of local absolute differences in the vicinity of potential anomalies.

  • granularity (UnionType[int>0, OffsetStr, None] (default: None)) – Search precision. Smaller values increase precision but also computational cost. If None, defaults to 5.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Notes

Minimum plateau length should be set higher than about twice the sampling rate. For detecting shorter anomalies, use flagUniLOF() or flagZScore().

Examples

Detect plateaus longer than 100 minutes:

>>> import saqc
>>> data = pd.read_csv('./resources/data/turbidity_plateaus.csv', parse_dates=['data'], index_col=0, nrows=10000)
>>> qc = saqc.SaQC(data)
>>> qc = qc.flagPlateau(field='base3', min_length='100min')
>>> qc.plot('base3')  
../_images/saqc-SaQC-3.png
flagRange(field, min=None, max=None, flag=255.0, **kwargs)#

Flag values that exceed fixed bound.

The function flags values that are not part of the closed interval [min, max].

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • min (float) – Lower bound.

  • max (float) – Upper bound.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

flagRegimeAnomaly(field, cluster_field, spread, method='single', metric=<function DriftMixin.<lambda>>, frac=0.5, flag=255.0, **kwargs)#

Flags anomalous regimes regarding to modelling regimes of field.

“Normality” is determined in terms of a maximum spreading distance, regimes must not exceed in respect to a certain metric and linkage method.

In addition, only a range of regimes is considered “normal”, if it models more then frac percentage of the valid samples in “field”.

Note, that you must detect the regime changepoints prior to calling this function.

Note, that it is possible to perform hypothesis tests for regime equality by passing the metric a function for p-value calculation and selecting linkage method “complete”.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • cluster_field (SaQCColumns) –

    Cluster labels variable.

    Column in data, holding the cluster labels for the samples in field. (has to be indexed equal to field)

  • spread (float>=0) –

    Agglomeration supremum.

    A threshold denoting the value level, up to wich clusters are agglomerated.

  • method (Literal['single', 'complete', 'average', 'weighted', 'centroid', 'median', 'ward'] (default: 'single')) –

    Linkage method used.

    The linkage method for hierarchical (agglomerative) clustering of the variables.

  • metric (Callable[[ndarray | Series, ndarray | Series], float] (default: <function DriftMixin.<lambda> at 0x7f22a7744900>)) –

    Metric of regime distances.

    A metric function for calculating the dissimilarity between 2 regimes. Defaults to the absolute difference in mean.

  • frac (float in [0, 1] (default: 0.5)) –

    Minimum variable portion for normal groups.

    The minimum percentage of samples, the “normal” group has to comprise to actually be the normal group. Must be in the closed interval [0,1], otherwise a ValueError is raised.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

flagUnflagged(field, flag=255.0, **kwargs)#

Assign flag to all UNFLAGGED periods.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Notes

Function ignores the dfilter keyword.

flagUniLOF(field, n=20, thresh=None, probability=None, corruption=None, algorithm='ball_tree', p=1, density='auto', fill_na=True, slope_correct=True, min_offset=None, flag=255.0, **kwargs)#

Univariate outlier detection, based on LOF.

This function wraps a standard LOF implementation and provides a simplified, parameter-minimal interface for univariate outlier detection. LOF is applied on a combination of the variable values and a temporal density/penalty measure that reflects spacing between data points.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • n (int>0 (default: 20)) –

    Neighborhood extension.

    Number of samples to include in the LOF neighborhood (the n nearest neighbors).

  • thresh (Union[Literal['auto'], float>=0, None] (default: None)) –

    Outlier-factor cutoff.

    Values with LOF scores greater than this threshold are flagged.

  • probability (Optional[float in [0, 1]] (default: None)) –

    Outlier-probability cutoff.

    Values with probabilities greater than this threshold are flagged.

  • corruption (UnionType[float in [0, 1], int>0, None] (default: None)) –

    Maximum portion of anomalous data.

    Either as a fraction in [0, 1] or as an integer specifying the number of anomalous samples.

  • algorithm (Literal['ball_tree', 'kd_tree', 'brute', 'auto'] (default: 'ball_tree')) – Nearest-neighbor assesment algorithm.

  • p (int>0 (default: 1)) –

    Minkowski degree for NN metric.

    Degree of the Minkowski metric used for neighbor distance calculation (e.g., 1 Manhattan, 2 Euclidean).

  • density (Union[Literal['auto'], float>0] (default: 'auto')) –

    Time-axis differential form.

    How to derive temporal density. 'auto' uses the median absolute step size, while passing a float sets a fixed increment.

  • fill_na (bool (default: True)) –

    Impute NaN values.

    If True, fill NaNs by linear interpolation before LOF calculation.

  • slope_correct (bool (default: True)) –

    Correction for high derivative regimes.

    If True, suppress flagging of groups of points, that seem to correspond to steep value slopes rather than to actual outliers.

  • min_offset (Optional[float>0] (default: None)) –

    Isolation threshold for outlier clusters.

    Minimum jump in values before and after an outlier cluster, needed to be registered in order for the cluster to be flagged.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Notes

  • The UniLOF score quantifies how outlier-like a point is in the 2D space defined by values and their timestamps.

  • Scores near 1 indicate inliers; larger scores suggest increasing anomaly likelihood.

  • A binary label (outlier vs. inlier) is obtained by applying the cutoff defined by thresh.

Examples

See the outlier detection cookbook for a detailed introduction and tuning guidance.

Example usage with default parameter configuration:

Loading data via the pandas CSV parser, casting the index to a DatetimeIndex, generating a SaQC instance from the data, and plotting the variable representing light scattering at 254 nanometers wavelength.

>>> import saqc
>>> data = pd.read_csv('./resources/data/hydro_data.csv')
>>> data = data.set_index('Timestamp')
>>> data.index = pd.DatetimeIndex(data.index)
>>> qc = saqc.SaQC(data)
>>> qc.plot('sac254_raw') 
../_images/saqc-SaQC-5.png

We apply flagUniLOF() with default parameter values. This means that the main calibration parameters n and thresh evaluate to 20 and 1.5, respectively.

>>> import saqc
>>> qc = qc.flagUniLOF('sac254_raw')
>>> qc.plot('sac254_raw') 
../_images/saqc-SaQC-6.png
flagZScore(field, method='standard', window=None, thresh=3, min_residuals=None, min_periods=None, center=True, axis=0, flag=255.0, **kwargs)#

Scattering (ZScoring) based outlier detection.

Uses standard score cutoffs to detect outliers. (For example, “3-sigma rule”.)( The function supports both standard and modified Z-score definitions. To handle non-stationary data, the calculation can be applied within a rolling window. A minimum residual value may be required to avoid over-flagging in low-variance segments.

Parameters:
  • field (SaQCFields) – List of variables names to process.

  • method (Literal['standard', 'modified'] (default: 'standard')) –

    Which scoring method to use.

    • "standard" — mean as expectation, standard deviation as scaling factor.

    • "modified" — median as expectation, median absolute deviation (MAD) as scaling factor.

  • window (UnionType[FreqStr, int>=0, None] (default: None)) –

    Size of the scoring window.

    Either an integer (number of periods) or an offset string (time span). If None (default), all data share a single window.

  • thresh (float>=0 (default: 3)) –

    Cutoff value.

    Points with absolute Z-scores larger than this threshold are flagged.

  • min_residuals (Optional[float>=0] (default: None)) –

    Distance to mean threshold.

    Minimum absolute distance a value must be apart from its context windows expectation, in order for the Z scoring test to be applied.

  • min_periods (Optional[int>0] (default: None)) –

    Minimum Population per window.

    Minimum number of valid observations that is required to be contained in a values context window, in order for the Z scoring test to be applied.

  • center (bool (default: True)) –

    Center windows around scored values.

    If True (default), the tested value is centered in its context window; otherwise, it is the window’s last value.

  • axis (int in [0, 1] (default: 0)) –

    Axis along which to compute scores.

    • 0 (default) — compute along the time axis only (separate windows for all fields).

    • 1 — compute along time and data axis (windows are 2 dimensional and span over all fields).

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Notes

For any data value \(x\) at timestamp \(t\) the following steps are performed in order to determine the flagging:

  1. Collect a context population \(X\) based on axis and window. * If axis=0, \(X\) contains values of the same field \(x\) is obtained from, sampled within window distance around \(t\). * If axis=1, \(X\) contains values of all fields, sampled within window distance from \(t\). * If axis=1 and window=1, \(X\) contains values of all fields sampled at \(t\).

../_images/ZscorePopulation.png
  1. Compute the score \(Z = \frac{|E(X) - x|}{S(X)}\)

    • If method="standard": \(E(X)=mean(X)\), \(S(X)=std(X)\)

    • If method="modified": \(E(X)=median(X)\), \(S(X)=MAD(X)\)

  2. Flag \(x\), if \(Z >\) thresh and \(|E(X) - x|\)> min_residuals.

forceFlags(field, flag=255.0, **kwargs)#

Assign specific flag to all periods.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

interpolateByRolling(field, window, func='median', center=True, min_periods=0, flag=-inf, **kwargs)#

Impute NAN with aggregation of context.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • window (OffsetStr | int>0) –

    Rolling window size.

    The size of the window, the aggregation is computed from. An integer define the number of periods to be used, a string is interpreted as an offset. ( see pandas.rolling for more information). Integer windows may result in screwed aggregations if called on none-harmonized or irregular data.

  • func (Union[Callable[[Series], float], Literal['sum', 'mean', 'median', 'min', 'max', 'last', 'first', 'std', 'var', 'count', 'sem', 'linear', 'time']] (default: 'median')) –

    Aggregation function.

    The function used for aggregation.

  • center (bool (default: True)) –

    Assign aggregation to center value?

    Center the window around the value. Can only be used with integer windows, otherwise it is silently ignored.

  • min_periods (int>=0 (default: 0)) –

    Minimum required population per window.

    Minimum number of valid (not np.nan) values that have to be available in a window for its aggregation to be computed.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

orGroup(field, group=None, target=None, flag=255.0, **kwargs)#

Combine flags via OR operation.

Flag the variable(s) field at every period, at wich field is flagged in at least one of the saqc objects in group.

See Examples section for examples.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • group (Optional[Sequence[SaQC]] (default: None)) –

    OR operands.

    A collection of SaQC objects. Flag checks are performed on all SaQC objects based on the variables specified in field. Whenever any of monitored variables is flagged, the associated timestamps will receive a flag.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Examples

Flag data, if the values are above a certain threshold (determined by flagRange()) OR if the values are constant for 3 periods (determined by flagConstants())

>>> dat = pd.Series([1,0,0,0,0,2,3,4,5,5,7,8], name='data', index=pd.date_range('2000', freq='10min', periods=12))
>>> qc = saqc.SaQC(dat)
>>> qc = qc.orGroup('data', group=[qc.flagRange('data', max=5), qc.flagConstants('data', thresh=0, window=3)])
>>> qc.flags['data']
2000-01-01 00:00:00     -inf
2000-01-01 00:10:00    255.0
2000-01-01 00:20:00    255.0
2000-01-01 00:30:00    255.0
2000-01-01 00:40:00    255.0
2000-01-01 00:50:00     -inf
2000-01-01 01:00:00     -inf
2000-01-01 01:10:00     -inf
2000-01-01 01:20:00     -inf
2000-01-01 01:30:00     -inf
2000-01-01 01:40:00    255.0
2000-01-01 01:50:00    255.0
Freq: 10min, dtype: float64
plot(field, path=None, max_gap=None, mode='oneplot', history='valid', xscope=None, yscope=None, store_kwargs=None, ax=None, ax_kwargs=None, marker_kwargs=None, plot_kwargs=None, dfilter=inf, **kwargs)#

Generate plots.

Causes plots to spawn in ‘interactive’ mode. To skip interactive mode and store figures instead, set parameter path.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • path (Optional[PathStr] (default: None)) –

    Path to store plot to.

    If None is passed, interactive mode is entered; plots are shown immediatly and a user need to close them manually before execution continues. If a filepath is passed instead, store-mode is entered and the plot is stored unter the passed location.

  • max_gap (Optional[OffsetStr] (default: None)) –

    Limit of plotted gaps.

    If None, all data points will be connected, resulting in long linear lines, in case of large data gaps. NaN values will be removed before plotting. If an offset string is passed, only points that have a distance below max_gap are connected via the plotting line.

  • mode (Literal['subplots', 'oneplot', 'biplot'] (default: 'oneplot')) –

    Plotting mode.

    How to process multiple variables to be plotted:

    • ”oneplot” : plot all variables with their flags in one axis (default)

    • ”subplots” : generate subplot grid where each axis contains one variable plot with associated flags

    • ”biplot” : plotting first and second variable in field against each other in a scatter plot (point cloud).

  • history (Union[Literal['valid', 'complete'], list[str], None] (default: 'valid')) –

    Plot flagging history.

    Discriminate the plotted flags with respect to the tests they originate from.

    • "valid": Only plot flags, that are not overwritten by subsequent tests. Only list tests in the legend, that actually contributed flags to the overall result.

    • None: Just plot the resulting flags for one variable, without any historical and/or meta information.

    • list of strings: List of tests. Plot flags from the given tests, only.

    • complete (not recommended, deprecated): Plot all the flags set by any test, independently from them being removed or modified by subsequent modifications. (this means: plotted flags do not necessarily match with flags ultimately assigned to the data)

  • xscope (UnionType[DateStringSlice, OffsetStr, DateIndexStr, None] (default: None)) –

    X axis limits.

    Determine a chunk of the data to be plotted. xscope can be anything, that is a valid argument to the pandas.Series.__getitem__ method.

  • yscope (UnionType[list[tuple[float, float]], tuple[float, float], dict, None] (default: None)) –

    Y axis limits.

    Either a tuple of 2 scalars that determines all plots’ y-view limits, or a list of those tuples, determining the different variables y-view limits (must match number of variables) or a dictionary with variables as keys and the y-view tuple as values.

  • ax (Optional[Axes] (default: None)) –

    Custom axes.

    If not None, plot into the given matplotlib.Axes instance, instead of a newly created matplotlib.Figure. This option offers a possibility to integrate SaQC plots into custom figure layouts.

  • store_kwargs (Optional[dict] (default: None)) –

    Save Configuration.

    Keywords to be passed on to the matplotlib.pyplot.savefig method, handling the figure storing. To store an pickle object of the figure, use the option {"pickle": True}, but note that all other store_kwargs are ignored then. To reopen a pickled figure execute: pickle.load(open(savepath, "w")).show()

  • ax_kwargs (Optional[dict] (default: None)) –

    Axes configuration.

    Axis keywords. Change axis specifics. Those are passed on to the matplotlib.axes.Axes.set method and can have the options listed there. The following options are saqc specific:

    • "xlabel": Either single string, that is to be attached to all x-axis´, or a List of labels, matching the number of variables to plot in length, or a dictionary, directly assigning labels to certain fields - defaults to None (no labels)

    • "ylabel": Either single string, that is to be attached to all y-axis´, or a List of labels, matching the number of variables to plot in length, or a dictionary, directly assigning labels to certain fields - defaults to None (no labels)

    • "title": Either a List of labels, matching the number of variables to plot in length, or a dictionary, directly assigning labels to certain variables - defaults to None (every plot gets titled the plotted variables name)

    • "fontsize": (float) Adjust labeling and titeling fontsize

    • "nrows", "ncols": shape of the subplot matrix the plots go into: If both are assigned, a subplot matrix of shape nrows x ncols is generated. If only one is assigned, the unassigned dimension is 1. defaults to plotting into subplot matrix with 2 columns and the necessary number of rows to fit the number of variables to plot.

  • marker_kwargs (Optional[dict] (default: None)) –

    Marker configuration.

    Keywords to modify flags marker appearance. The markers are set via the matplotlib.pyplot.scatter method and can have the options listed there. The following options are saqc specific:

    • "cycleskip": (int) start the cycle of shapes that are assigned any flag-type with a certain lag - defaults to 0 (no skip)

  • plot_kwargs (Optional[dict] (default: None)) –

    Plot rendering configuration.

    Keywords to modify the plot appearance. The plotting is delegated to matplotlib.pyplot.plot, all options listed there are available. Additionally the following saqc specific configurations are possible:

    • "alpha": Either a scalar float in [0,1], that determines all plots’ transparencies, or a list of floats, matching the number of variables to plot.

    • "linewidth": Either single float in [0,1], that determines the thickness of all plotted, or a list of floats, matching the number of variables to plot.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Notes

  • Check/modify the module parameter saqc.lib.plotting.SCATTER_KWARGS to see/modify global marker defaults

  • Check/modify the module parameter saqc.lib.plotting.PLOT_KWARGS to see/modify global plot line defaults

processGeneric(field, func, target=None, dfilter=-inf, **kwargs)#

Apply custom transformation.

Function func will be applied to the timeseries represented by field and the result will be written to the variable target.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • func (GenericFunction) –

    Function that transforms field-data.

    This function is expected to map input data series to a series/array of the same size. If field has multiple values, those will be mapped monotounosly to the multiple arguments of func. The number of arguments, func implements, must match the number of elements in field.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Notes

All the numpy functions are available within the generic expressions.

Examples

Compute the sum of the variables ‘rainfall’ and ‘snowfall’ and save the result to a (new) variable ‘precipitation’

Examples

rainfall snowfall precipitation

1970-01-01 1 2 3

propagateFlags(field, window, method='ffill', flag=255.0, dfilter=-inf, **kwargs)#

Propagate flags along date axis.

Extent and direction of propagation can be controlled through parameters window and method.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • window (OffsetStr | int>0) –

    Extension of the propagation.

    An integer defines the exact number of periods to propagate, while a string is interpreted as a time offset.

  • method (Literal['ffill', 'bfill'] (default: 'ffill')) –

    Direction of the propagation.

    • ffill — propagate flag to subsequent values

    • bfill — propagate flag to preceding values

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Examples

First, generate some data and some flags:

>>> import saqc
>>> data = pd.DataFrame({"a": [-3, -2, -1, 0, 1, 2, 3]})
>>> flags = pd.DataFrame({"a": [-np.inf, -np.inf, -np.inf, 255.0, -np.inf, -np.inf, -np.inf]})
>>> qc = saqc.SaQC(data=data, flags=flags)
>>> qc.flags["a"]
0     -inf
1     -inf
2     -inf
3    255.0
4     -inf
5     -inf
6     -inf
dtype: float64

Now, to repeat the flag ‘255.0’ two times in the direction of ascending indices, execute:

>>> qc.propagateFlags('a', window=2, method="ffill").flags["a"]
0     -inf
1     -inf
2     -inf
3    255.0
4    255.0
5    255.0
6     -inf
dtype: float64

Choosing “bfill” will result in:

>>> qc.propagateFlags('a', window=2, method="bfill").flags["a"]
0     -inf
1    255.0
2    255.0
3    255.0
4     -inf
5     -inf
6     -inf
dtype: float64

If an explicit flag is passed, it will be used to fill the repetition window:

>>> qc.propagateFlags('a', window=2, method="bfill", flag=111).flags["a"]
0     -inf
1    111.0
2    111.0
3    255.0
4     -inf
5     -inf
6     -inf
dtype: float64
reindex(field, index, method='match', tolerance=None, data_aggregation=None, flags_aggregation=None, broadcast=True, squeeze=False, override=False, **kwargs)#

Resample data at new index.

Simultaneously changes the indices of the data, flags and the history assigned to field.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • index (FreqStr | DatetimeIndex | SaQCColumns) –

    New index.

    • If an offset string: new index will range from start to end of the original index of field, exhibting a uniform sampling rate of idx

    • If a str that matches a field present in the SaQC object, that fields index will be used as new index of field

    • If an pd.index object is passed, that will be the new index of field.

  • method (Literal['fagg', 'bagg', 'nagg', 'froll', 'broll', 'nroll', 'fshift', 'bshift', 'nshift', 'match', 'sshift', 'mshift', 'invert'] (default: 'match')) –

    Reindexing method.

    Determines which of the origins indexes periods to comprise into the calculation of a new flag and a new data value at any period of the new index.

    • Aggregations Reindexer. Aggregations are data and flags independent, (pure) index selection methods:

    • ’bagg’/’fagg’: “backwards/forwards aggregation”. Any new index period gets assigned an aggregation of the values at periods in the original index, that lie between itself and its successor/predecessor.

    • ’nagg’: “nearest aggregation”: Any new index period gets assigned an aggregation of the values at periods in the original index between its direcet predecessor and successor, it is the nearest neighbor to.

    • Rolling reindexer. Rolling reindexers are equal to aggregations, when projecting between regular and irregular sampling grids forth and back. But due to there simple rolling window construction, they are easier to comprehend, predict and parametrize. On the downside, they are much more expensive computationally and Also, periods can get included in the aggregation to multpiple target periods, (when rolling windows overlap).

    • ’broll’/’froll’: Any new index period gets assigned an aggregation of all the values at periods of the original index, that fall into a directly preceeding/succeeding window of size reindex_window.

    • Shifts. Shifting methods are shortcuts for aggregation reindex methods, combined with selecting ‘last’ or ‘first’ as the data_aggregation method. Therefor, both, the flags_aggregation and the data_aggregation are ignored when using a shift reindexer. Also, periods where the data evaluates to NaN are dropped before shift index selection.

    • ’bshift’/fshift: “backwards/forwards shift”. Any new index period gets assigned the first/last valid (not a data NaN) value it succeeds/preceeds

    • ’nshift’: “nearest shift”: Any new index period gets assigned the value of its closest neighbor in the original index.

    • Pillar point Mappings. Index selection method designed to select indices suitable for linearly interpolating index values from surrounding pillar points in the original index, or inverting such a selection. Periods where the data evaluates to NaN, are dropped from consideration.

    • ’mshift’: “Merge” predecessors and successors. Any new index period gets assigned an aggregation/interpolation comprising the last and the next valid period in the original index.

    • ’sshift’: “Split”-map values onto predecessors and successors. Same as mshift, but with a correction that prevents missing value flags from being mapped to continuous data chunk bounds.

    • Inversion of last method: try to select the method, that

    • ’invert`

  • tolerance (UnionType[OffsetStr, OffsetLike, None] (default: None)) –

    Reindexing scope.

    Limiting the distance, values can be shifted or comprised into aggregation.

  • data_aggregation (Union[Literal['sum', 'mean', 'median', 'min', 'max', 'last', 'first', 'std', 'var', 'count', 'sem', 'linear', 'time'], Callable, float, None] (default: None)) –

    Data aggregation function.

    Function string or custom Function, determining how to aggregate new data values from the values at the periods selected according to the index_selection_method. If a scalar value is passed, the new data series will just evaluate to that scalar at any new index.

  • flags_aggregation (Union[Literal['sum', 'mean', 'median', 'min', 'max', 'last', 'first', 'std', 'var', 'count', 'sem', 'linear', 'time'], Callable, float, None] (default: None)) –

    Flags Aggregation function.

    Function string or custom Function, determining how to aggregate new flags values from the values at the periods selected according to the index_selection_method. If a scalar value is passed, the new flags series will just evaluate to that scalar at any new index.

  • broadcast (bool (default: True)) –

    Broadcast to reindexing scope.

    Weather to propagate aggregation result to full reindex window when using aggregation reindexer. (as opposed to only assign to next/previous/closest)

  • target (SaQCFields | newSaQCFields) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Notes

../_images/reindexMethods.png

Examples

Generate some example data with messed up 1 day-ish sampling rate

>>> import pandas as pd
>>> import saqc
>>> import numpy as np
>>> from saqc.constants import FILTER_NONE
>>> np.random.seed(23)
>>> index = pd.DatetimeIndex(pd.date_range('2000', freq='1d', periods=23))
>>> index += pd.Index([pd.Timedelta(f'{k}min') for k in np.random.randint(-360,360,23)])
>>> drops = np.random.randint(0,20,3)
>>> drops.sort()
>>> index=index[np.r_[0:drops[0],drops[0]+1:drops[1],drops[1]+1:drops[2],drops[2]+1:23]]
>>> data = pd.Series(np.abs(np.arange(-10,10)), index=index, name='data')
>>> data 
2000-01-01 03:55:00    10
2000-01-03 02:08:00     9
2000-01-03 18:31:00     8
2000-01-04 21:57:00     7
2000-01-06 01:40:00     6
2000-01-06 23:47:00     5
2000-01-09 04:02:00     4
2000-01-10 05:05:00     3
2000-01-10 18:06:00     2
2000-01-12 01:09:00     1
2000-01-13 02:44:00     0
2000-01-13 18:49:00     1
2000-01-15 05:46:00     2
2000-01-16 01:39:00     3
2000-01-17 05:49:00     4
2000-01-17 21:12:00     5
2000-01-18 18:12:00     6
2000-01-21 03:20:00     7
2000-01-21 22:57:00     8
2000-01-23 03:51:00     9
Name: data, dtype: int64

Performing linear alignment to 2 days grid, with and without limiting the reindexing range:

>>> qc = saqc.SaQC(data)
>>> qc = qc.reindex('data', target='linear', index='2D', method='mshift', data_aggregation='linear')
>>> qc = qc.reindex('data', target='limited_linear', index='2D', method='mshift', data_aggregation='linear', tolerance='1D')
>>> qc.data 
                   data |               linear |       limited_linear |
======================= | ==================== | ==================== |
2000-01-01 03:55:00  10 | 1999-12-31       NaN | 1999-12-31       NaN |
2000-01-03 02:08:00   9 | 2000-01-02  9.565453 | 2000-01-02       NaN |
2000-01-03 18:31:00   8 | 2000-01-04  7.800122 | 2000-01-04  7.800122 |
2000-01-04 21:57:00   7 | 2000-01-06  6.060132 | 2000-01-06       NaN |
2000-01-06 01:40:00   6 | 2000-01-08  4.536523 | 2000-01-08       NaN |
2000-01-06 23:47:00   5 | 2000-01-10  3.202927 | 2000-01-10  3.202927 |
2000-01-09 04:02:00   4 | 2000-01-12  1.037037 | 2000-01-12       NaN |
2000-01-10 05:05:00   3 | 2000-01-14  1.148307 | 2000-01-14       NaN |
2000-01-10 18:06:00   2 | 2000-01-16  2.917016 | 2000-01-16  2.917016 |
2000-01-12 01:09:00   1 | 2000-01-18  5.133333 | 2000-01-18  5.133333 |
2000-01-13 02:44:00   0 | 2000-01-20  6.521587 | 2000-01-20       NaN |
2000-01-13 18:49:00   1 | 2000-01-22  8.036332 | 2000-01-22       NaN |
2000-01-15 05:46:00   2 | 2000-01-24       NaN | 2000-01-24       NaN |
2000-01-16 01:39:00   3 |                      |                      |
2000-01-17 05:49:00   4 |                      |                      |
2000-01-17 21:12:00   5 |                      |                      |
2000-01-18 18:12:00   6 |                      |                      |
2000-01-21 03:20:00   7 |                      |                      |
2000-01-21 22:57:00   8 |                      |                      |
2000-01-23 03:51:00   9 |                      |                      |

Setting a flag, reindexing the linearly aligned field with the originl index (deharmonisation”)

>>> qc = qc.setFlags('linear', data=['2000-01-16'])
>>> qc = qc.reindex('linear', index='data', tolerance='2D', method='sshift', dfilter=FILTER_NONE)
>>> qc.flags[['data', 'linear']] 
                    data |                     linear |
======================== | ========================== |
2000-01-01 03:55:00 -inf | 2000-01-01 03:55:00   -inf |
2000-01-03 02:08:00 -inf | 2000-01-03 02:08:00   -inf |
2000-01-03 18:31:00 -inf | 2000-01-03 18:31:00   -inf |
2000-01-04 21:57:00 -inf | 2000-01-04 21:57:00   -inf |
2000-01-06 01:40:00 -inf | 2000-01-06 01:40:00   -inf |
2000-01-06 23:47:00 -inf | 2000-01-06 23:47:00   -inf |
2000-01-09 04:02:00 -inf | 2000-01-09 04:02:00   -inf |
2000-01-10 05:05:00 -inf | 2000-01-10 05:05:00   -inf |
2000-01-10 18:06:00 -inf | 2000-01-10 18:06:00   -inf |
2000-01-12 01:09:00 -inf | 2000-01-12 01:09:00   -inf |
2000-01-13 02:44:00 -inf | 2000-01-13 02:44:00   -inf |
2000-01-13 18:49:00 -inf | 2000-01-13 18:49:00   -inf |
2000-01-15 05:46:00 -inf | 2000-01-15 05:46:00  255.0 |
2000-01-16 01:39:00 -inf | 2000-01-16 01:39:00  255.0 |
2000-01-17 05:49:00 -inf | 2000-01-17 05:49:00   -inf |
2000-01-17 21:12:00 -inf | 2000-01-17 21:12:00   -inf |
2000-01-18 18:12:00 -inf | 2000-01-18 18:12:00   -inf |
2000-01-21 03:20:00 -inf | 2000-01-21 03:20:00   -inf |
2000-01-21 22:57:00 -inf | 2000-01-21 22:57:00   -inf |
2000-01-23 03:51:00 -inf | 2000-01-23 03:51:00   -inf |

Now, linear flags can easily be appended to data, to complete “deharm” step.

Another example: Shifting to nearest regular frequeny and back. Note, how ‘nearest’ - style reindexers “invert” themselfs.

>>> qc = saqc.SaQC(data)
>>> qc = qc.reindex('data', index='1D', target='n_shifted', method='nshift')
>>> qc = qc.reindex('n_shifted', index='data', target='n_shifted_undone', method='nshift')
>>> qc.data 
                   data |        n_shifted |          n_shifted_undone |
======================= | ================ | ========================= |
2000-01-01 03:55:00  10 | 2000-01-01  10.0 | 2000-01-01 03:55:00  10.0 |
2000-01-03 02:08:00   9 | 2000-01-02   NaN | 2000-01-03 02:08:00   9.0 |
2000-01-03 18:31:00   8 | 2000-01-03   9.0 | 2000-01-03 18:31:00   8.0 |
2000-01-04 21:57:00   7 | 2000-01-04   8.0 | 2000-01-04 21:57:00   7.0 |
2000-01-06 01:40:00   6 | 2000-01-05   7.0 | 2000-01-06 01:40:00   6.0 |
2000-01-06 23:47:00   5 | 2000-01-06   6.0 | 2000-01-06 23:47:00   5.0 |
2000-01-09 04:02:00   4 | 2000-01-07   5.0 | 2000-01-09 04:02:00   4.0 |
2000-01-10 05:05:00   3 | 2000-01-08   NaN | 2000-01-10 05:05:00   3.0 |
2000-01-10 18:06:00   2 | 2000-01-09   4.0 | 2000-01-10 18:06:00   2.0 |
2000-01-12 01:09:00   1 | 2000-01-10   3.0 | 2000-01-12 01:09:00   1.0 |
2000-01-13 02:44:00   0 | 2000-01-11   2.0 | 2000-01-13 02:44:00   0.0 |
2000-01-13 18:49:00   1 | 2000-01-12   1.0 | 2000-01-13 18:49:00   1.0 |
2000-01-15 05:46:00   2 | 2000-01-13   0.0 | 2000-01-15 05:46:00   2.0 |
2000-01-16 01:39:00   3 | 2000-01-14   1.0 | 2000-01-16 01:39:00   3.0 |
2000-01-17 05:49:00   4 | 2000-01-15   2.0 | 2000-01-17 05:49:00   4.0 |
2000-01-17 21:12:00   5 | 2000-01-16   3.0 | 2000-01-17 21:12:00   5.0 |
2000-01-18 18:12:00   6 | 2000-01-17   4.0 | 2000-01-18 18:12:00   6.0 |
2000-01-21 03:20:00   7 | 2000-01-18   5.0 | 2000-01-21 03:20:00   7.0 |
2000-01-21 22:57:00   8 | 2000-01-19   6.0 | 2000-01-21 22:57:00   8.0 |
2000-01-23 03:51:00   9 | 2000-01-20   NaN | 2000-01-23 03:51:00   9.0 |
                        | 2000-01-21   7.0 |                           |
                        | 2000-01-22   8.0 |                           |
                        | 2000-01-23   9.0 |                           |
                        | 2000-01-24   NaN |                           |

Furthermoer, forward/backward style reindexers can be inverted by backward/forward style reindexers:

>>> qc = saqc.SaQC(data)
>>> qc = qc.reindex('data', target='sum_aggregate', index='3D', method='fagg', data_aggregation='sum')
>>> qc = qc.setFlags('sum_aggregate', data=['2000-01-18', '2000-01-24'])
>>> qc = qc.reindex('sum_aggregate', target='bagg', index='data', method='bagg', dfilter=FILTER_NONE)
>>> qc = qc.reindex('sum_aggregate', target='bagg_limited', index='data', method='bagg', tolerance='2D', dfilter=FILTER_NONE)
>>> qc.flags 
                    data |     sum_aggregate |                       bagg |               bagg_limited |
======================== | ================= | ========================== | ========================== |
2000-01-01 03:55:00 -inf | 1999-12-31   -inf | 2000-01-01 03:55:00   -inf | 2000-01-01 03:55:00   -inf |
2000-01-03 02:08:00 -inf | 2000-01-03   -inf | 2000-01-03 02:08:00   -inf | 2000-01-03 02:08:00   -inf |
2000-01-03 18:31:00 -inf | 2000-01-06   -inf | 2000-01-03 18:31:00   -inf | 2000-01-03 18:31:00   -inf |
2000-01-04 21:57:00 -inf | 2000-01-09   -inf | 2000-01-04 21:57:00   -inf | 2000-01-04 21:57:00   -inf |
2000-01-06 01:40:00 -inf | 2000-01-12   -inf | 2000-01-06 01:40:00   -inf | 2000-01-06 01:40:00   -inf |
2000-01-06 23:47:00 -inf | 2000-01-15   -inf | 2000-01-06 23:47:00   -inf | 2000-01-06 23:47:00   -inf |
2000-01-09 04:02:00 -inf | 2000-01-18  255.0 | 2000-01-09 04:02:00   -inf | 2000-01-09 04:02:00   -inf |
2000-01-10 05:05:00 -inf | 2000-01-21   -inf | 2000-01-10 05:05:00   -inf | 2000-01-10 05:05:00   -inf |
2000-01-10 18:06:00 -inf | 2000-01-24  255.0 | 2000-01-10 18:06:00   -inf | 2000-01-10 18:06:00   -inf |
2000-01-12 01:09:00 -inf |                   | 2000-01-12 01:09:00   -inf | 2000-01-12 01:09:00   -inf |
2000-01-13 02:44:00 -inf |                   | 2000-01-13 02:44:00   -inf | 2000-01-13 02:44:00   -inf |
2000-01-13 18:49:00 -inf |                   | 2000-01-13 18:49:00   -inf | 2000-01-13 18:49:00   -inf |
2000-01-15 05:46:00 -inf |                   | 2000-01-15 05:46:00  255.0 | 2000-01-15 05:46:00   -inf |
2000-01-16 01:39:00 -inf |                   | 2000-01-16 01:39:00  255.0 | 2000-01-16 01:39:00  255.0 |
2000-01-17 05:49:00 -inf |                   | 2000-01-17 05:49:00  255.0 | 2000-01-17 05:49:00  255.0 |
2000-01-17 21:12:00 -inf |                   | 2000-01-17 21:12:00  255.0 | 2000-01-17 21:12:00  255.0 |
2000-01-18 18:12:00 -inf |                   | 2000-01-18 18:12:00   -inf | 2000-01-18 18:12:00   -inf |
2000-01-21 03:20:00 -inf |                   | 2000-01-21 03:20:00  255.0 | 2000-01-21 03:20:00   -inf |
2000-01-21 22:57:00 -inf |                   | 2000-01-21 22:57:00  255.0 | 2000-01-21 22:57:00   -inf |
2000-01-23 03:51:00 -inf |                   | 2000-01-23 03:51:00  255.0 | 2000-01-23 03:51:00  255.0 |
renameField(field, new_name, **kwargs)#

Rename field.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • new_name (str) – New name for the field.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

resample(field, freq, func='mean', method='bagg', maxna=None, maxna_group=None, squeeze=False, **kwargs)#

Sample data at uniform sampling rate.

The data will be resampled at equidistant timestamps. Sampling intervals therefore get aggregated with a function, specified by func, the result is projected to the new timestamps using method. The following methods are available:

  • 'nagg': all values in the range (+/- freq/2) of a grid point get aggregated with func and assigned to it.

  • 'bagg': all values in a sampling interval get aggregated with func and the result gets assigned to the last grid point.

  • 'fagg': all values in a sampling interval get aggregated with func and the result gets assigned to the next grid point.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • freq (FreqStr | Timedelta) – New sampling rate.

  • func (Union[Callable[[Series], Series], str] (default: 'mean')) –

    Aggregation function.

    See notes for performance considerations.

  • method (Literal['fagg', 'bagg', 'nagg'] (default: 'bagg')) –

    Resampling intervals.

    Specifies which intervals to be aggregated for a certain timestamp. (preceding, succeeding or “surrounding” interval). See description above for more details.

  • maxna (Optional[int>=0] (default: None)) –

    Limit for number of NaN values.

    Maximum number of allowed NaN``s in a resampling interval. If exceeded, the aggregation of the interval evaluates to ``NaN.

  • maxna_group (Optional[int>=0] (default: None)) –

    Limit for number of consecutive values.

    Same as maxna but for consecutive NaNs.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Notes

For perfomance reasons, func will be mapped to pandas.resample methods, if possible. However, for this to work, functions need an initialized __name__ attribute, holding the function’s name. Furthermore, you should not pass numpys nan-functions (nansum, nanmean,…) because they cannot be optimised and the handling of NaN is already taken care of.

rolling(field, window, target=None, func='mean', min_periods=0, center=True, **kwargs)#

Rolling window function application.

Evaluate a function at all shifts of a fixed-size window (“rolling window application”).

The resulting values are assigned the worst flag present in the window from which they were aggregated. Multiple fields can be selected in order to apply a rolling function on arrays obtained from the concatenation of the different field specific windows.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • window (OffsetStr | int>0) –

    Rolling window size.

    Size of the rolling window. If an integer, it determines the window size as the number of periods it has to contain at every shift. If an offset string, it determines the window size as its constant temporal extension. For regularly sampled data, the period number is rounded down to an odd number in case center is True.

  • func (Union[Callable[[Series], ndarray], Literal['sum', 'mean', 'median', 'min', 'max', 'std', 'var', 'skew', 'kurt', 'count']] (default: 'mean')) –

    Aggregation function.

    Function to apply to window at each shift. Can either be a custom callable, expecting a pandas.Series object as its input, or a literal from the following list:

    • ”sum” : Sum of values in the window

    • ”mean” : Average of values

    • ”median” : Median

    • ”min” : Minimum

    • ”max” : Maximum

    • ”std” : Standard deviation

    • ”var” : Variance

    • ”skew” : Skewness

    • ”kurt” : Kurtosis

    • ”count” : Number of non-NA observations in the window

  • min_periods (int>=0 (default: 0)) – Minimum population in rolling window. Minimum number of valid observations in the window required to calculate a value.

  • center (bool (default: True)) – Assign function result to window center. If True, function results are assigned to the timestamp at the center of the windows; if False, they are assigned to the highest timestamp in the windows.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Notes

../_images/horizontalAxisRollingExample.png

Example of rolling over multiple variables.#

selectTime(field, mode, selection_field=None, start=None, end=None, closed=True, **kwargs)#

Apply a mask.

Due to some inner saqc mechanics, it is not straight forwardly possible to exclude values or datachunks from flagging routines. This function replaces flags with UNFLAGGED value, wherever values are to get masked. Furthermore, the masked values get replaced by np.nan, so that they dont effect calculations.

Here comes a recipe on how to apply a flagging function only on a masked chunk of the variable field:

  1. dublicate “field” in the input data (copyField)

  2. mask the dublicated data (this, selectTime)

  3. apply the tests you only want to be applied onto the masked data chunks (a saqc function)

  4. project the flags, calculated on the dublicated and masked data onto the original field data (concateFlags or flagGeneric)

  5. drop the dublicated data (dropField)

To see an implemented example, checkout flagSeasonalRange in the saqc.functions module

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • mode (Literal['periodic', 'selection_field']) –

    The masking mode.

    • ”periodic”: parameters “period_start”, “end” are evaluated to generate a periodical mask

    • ”mask_var”: data[mask_var] is expected to be a boolean valued timeseries and is used as mask.

  • selection_field (Optional[SaQCColumns] (default: None)) –

    Variable holding the mask.

    Only effective if mode == “mask_var” Fieldname of the column, holding the data that is to be used as mask. (must be boolean series) Neither the series` length nor its labels have to match data[field]`s index and length. An inner join of the indices will be calculated and values get masked where the values of the inner join are True.

  • start (Optional[TimestampStr] (default: None)) –

    Season start.

    Only effective if mode == “seasonal” String denoting starting point of every period. Formally, it has to be a truncated instance of “mm-ddTHH:MM:SS”. Has to be of same length as end parameter. See examples section below for some examples.

  • end (Optional[TimestampStr] (default: None)) –

    Season end.

    Only effective if mode == “periodic” String denoting starting point of every period. Formally, it has to be a truncated instance of “mm-ddTHH:MM:SS”. Has to be of same length as end parameter. See examples section below for some examples.

  • closed (bool (default: True)) –

    Closure at mask alternation.

    Wheather or not to include the mask defining bounds to the mask.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Examples

The period_start and end parameters provide a conveniant way to generate seasonal / date-periodic masks. They have to be strings of the forms:

  • “mm-ddTHH:MM:SS”

  • “ddTHH:MM:SS”

  • “HH:MM:SS”

  • “MM:SS” or “SS”

(mm=month, dd=day, HH=hour, MM=minute, SS=second) Single digit specifications have to be given with leading zeros. period_start and seas on_end strings have to be of same length (refer to the same periodicity) The highest date unit gives the period. For example:

>>> start = "01T15:00:00"
>>> end = "13T17:30:00"

Will result in all values sampled between 15:00 at the first and 17:30 at the 13th of every month get masked

>>> start = "01:00"
>>> end = "04:00"

All the values between the first and 4th minute of every hour get masked.

>>> start = "01-01T00:00:00"
>>> end = "01-03T00:00:00"

Mask january and february of evcomprosed in theery year. masking is inclusive always, so in this case the mask will include 00:00:00 at the first of march. To exclude this one, pass:

>>> start = "01-01T00:00:00"
>>> end = "02-28T23:59:59"

To mask intervals that lap over a seasons frame, like nights, or winter, exchange sequence of season start and season end. For example, to mask night hours between 22:00:00 in the evening and 06:00:00 in the morning, pass:

>> start = “22:00:00” >> end = “06:00:00”

setFlags(field, data, override=True, flag=255.0, **kwargs)#

Assign scheduled flags.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • data (SaQCColumns | list | ArrayLike | Series) –

    Flags schedule.

    Determines which timestamps to set flags at. Depending on the passed type:

    • 1-d array or List of timestamps or pandas.Index: flag field with flag at every timestamp in f_data

    • 2-d array or List of tuples: for all elements t[k] out of f_data: flag field with flag at every timestamp in between t[k][0] and t[k][1]

    • pd.Series: flag field with flag in between any index and data value of the passed series

    • str: use the variable timeseries f_data as flagging template

    • pd.Series: flag field with flag in between any index and data value of the passed series

    • 1-d array or List of timestamps: flag field with flag at every timestamp in f_data

    • 2-d array or List of tuples: for all elements t[k] out of f_data: flag field with flag at every timestamp in between t[k][0] and t[k][1]

  • override (bool (default: True)) –

    Override existing flags.

    Determines if flags shall be assigned although the value in question already has a flag assigned.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

supervise(field, problem_labels, override=True, target=None, **kwargs)#

Supervise data, so saqc parameter estimation can be ran against it.

The function annotates specific flags (columns) as being the Ground Truth (True Positives) For subsequent calibration of flagging function pipeline.

Supervise drops all history columns/flags that are not listed as problem labels.

Parameters:
  • field (str) – Name of the input variable to process.

  • problem_labels (list[Optional[str]]) –

    A list of anomaly types, data gets to be classified by.

    Any label that does not appear in field history, will cause pop up GUI where targets for this anomaly type can be assigned. Pass None label triggers semi supervised fit (Fitting without flags). Join labels to a single target by summarizing them as list items.

  • override (bool (default: True)) – If True (default), target is overridden with variable and its supervised history, (if target already exists.) If False, target gets appended supervised history, (if target already exists).

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

transferFlags(field, target, squeeze=False, overwrite=False, **kwargs)#

Transfer flags between variables.

Flags present at timestamps in the source field(s) are also assigned to that same timestamps in the target field(s).

Optionally, flags already assigned to target, are being overridden or squashed together with the new assignment, into a single flags column.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • squeeze (bool (default: False)) –

    Append history or aggregation.

    If True, flagging history of field is compressed and function-specific flag information is lost, before it gets appended to target. If False, flagging history of field is appended to target.

  • overwrite (bool (default: False)) –

    Override target flags.

    If True, existing flags in the target field are overridden.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC

Examples

First, generate some data with flags:

>>> import saqc
>>> data = pd.DataFrame({'a': [1, 2], 'b': [1, 2], 'c': [1, 2]})
>>> qc = saqc.SaQC(data)
>>> qc = qc.flagRange('a', max=1.5)
>>> qc.flags.to_pandas()
       a    b    c
0   -inf -inf -inf
1  255.0 -inf -inf

Project the flag from a to b:

>>> qc = qc.transferFlags('a', target='b')
>>> qc.flags.to_pandas()
       a      b    c
0   -inf   -inf -inf
1  255.0  255.0 -inf

Project flags of a to both b and c:

>>> qc = qc.transferFlags(['a','a'], ['b', 'c'], overwrite=True)
>>> qc.flags.to_pandas()
       a      b      c
0   -inf   -inf   -inf
1  255.0  255.0  255.0
transform(field, func, freq=None, **kwargs)#

Data transformation.

Transform data by applying a custom function on data chunks of variable size. Existing flags are preserved.

Parameters:
  • field (SaQCFields) – Name of the input variable to process.

  • func (Union[Callable[[Series | ndarray], Series], Literal['sum', 'mean', 'median', 'min', 'max', 'last', 'first', 'std', 'var', 'count', 'sem', 'linear', 'time']]) – Transformation function.

  • freq (UnionType[int>0, FreqStr, None] (default: None)) –

    Segmentation size.

    The transformation is applied on each segment individually

    • None: Apply transformation on the entire data set at once

    • int : Apply transformation on successive data chunks of the given length. Must be grater than 0.

    • Offset String : Apply transformation on successive data chunks of the given temporal extension.

  • target (SaQCFields | newSaQCFields , optional) – Name of the variable to which the results are written. If the variable does not exist, it will be created. Defaults to field.

  • dfilter (Any, optional) – Defines which observations are masked based on their existing flags. Any data point with a flag value greater than or equal to this threshold is passed to the function as numpy.nan. Defaults to the DFILTER_DEFAULT value of the active flagging scheme.

  • flag (Any, optional) – Flag value used to annotate detected observations. Defaults to the BAD value of the active flagging scheme.

  • start_date (pd.Timestamp | datetime.datetime | str, optional) – Lower temporal bound for function execution. Only observations with timestamps greater than or equal to start_date are processed. String inputs may be partially specified (e.g., '15:00', '01T12:00', '01-01') to restrict recurring temporal patterns.

  • end_date (pd.Timestamp | datetime.datetime | str, optional) – Upper temporal bound for function execution. Only observations with timestamps less than or equal to end_date are processed. String inputs may be partially specified to restrict recurring temporal patterns.

Returns:

SaQC – the updated SaQC object

Return type:

saqc.SaQC