ufzLogo rdmLogo

Calibrating Pipelines#

The tutorial aims to introduce the calibration of flagging and data-filtering pipelines composed of SaQC methods.

Data Import#

Load the example data set from the saqc repository using the pandas CSV file reader. Then, create an SaQC instance from the data and generate a plot using the plot() method.

>>> import saqc
>>> data = pd.read_csv('./resources/data/corruptedTemperature.csv', index_col=0, parse_dates=[0])
>>> qc = saqc.SaQC(data)
>>> qc.plot(['temp0', 'temp1'], mode='subplots')
../_images/CalibratingPipelines-2.png

Calibrate Single Type Pipeline#

To calibrate a pipeline that specifically targets the outliers in the data, we need some example outliers, the pipeline can be trained on. Those examples can be generated using the supervise() method.

To calibrate a pipeline that specifically targets outliers in the data, we need examples of outliers on which the pipeline can be trained. These examples can be marked using the supervise() method.

We dont need to work through the whole year of available data. Identifying outliers in January should suffice.

So, data_start and end_date are set to 2017-01-01 and 2017-02-01. The label we want to associate with the assignment, would be ‘outliers’ (as we identify outliers), which we assign to the label keyword.

>>> qc = qc.flagByClick('temp0', gui_mode='overlay', label='outliers', start_date='2017-01-01', end_date='2017-02-01')
../_images/CalibratingPipelines-3.png

Fig. 24 Supervision GUI.#

One can add values to the selection of calibration targets with the rectangle selector by right-clicking, holding and dragging over the outlierish values. Doing the same with a left-click, does remove points from the selection again:

../_images/supervisionGUItemp0FlagsAssigned.png

By clicking the Assign Flags button, the assignment is approved of and carried out. Now, a flagging method can be calibrated with that assignment. Calibration should be confined to the same time frame as supervision. That is why we pass on, the same values to end_date and end_date as above.

As we labeled the targets of the calibration ‘outliers’, the parameter problem_labels is assigned the single element list [‘outliers’] again.

To determine the method that gets calibrated to the target, we assign a list of methods to problems. Since we specifically target values exceeding local standard scattering (=*outliers*), we assign the outlier calibration ‘outliers’.

Finally, we assign the resulting, calibrated function a name via the name parameter.

>>> qc.calibratePipeline('temp0', name='calibratedOutlierDetector', problems=['outliers'], problem_labels=['outliers'], start_date='2017-01-01', end_date='2017-02-01')

After the calibration is completed, the so calibrated function can be accessed as a usual method by the assigned name value. We apply the calibrated function and plot the flagging result:

>>> qc.calibratedOutlierDetector('temp0').plot('temp0')
../_images/CalibratingPipelines-4.png

To make available the calibrated function for future sessions or integrate it with automated pipeline setups, optimal parameters and configuration file can be logged to a folder, by assigning its path the parameter log_path. Lets do this for the second variable. As the data wasnt supervised, the supervision GUI will be called on the fly, so we can skip the call of supervise.

>>> qc.calibratePipeline('temp1', problems=['outliers'], problem_labels=['outliers'], start_date='2017-01-01', end_date='2017-02-01', log_path=PATH)

The configuration file will be stored to “PATHconfig.csv” and can be loaded (and applied to field) via the applyConfig Method:

>>> qc.applyConfig('temp1', path=PATH + '/config.csv').plot('temp1')
../_images/CalibratingPipelines-5.png

To calibrate a pipeline that targets both, the outlier values and also the noise, we can add flags withs a noise label:

>>> qc = qc.flagByClick('temp0', label='noise', gui_mode='overlay', start_date='2017-01-01', end_date='2017-02-01')
../_images/AdditionalSelection.png

To run a pipeline that catches both the anomaly types, we sequentially calibrate a noise problem, targeting the ‘noise’-labeled flags and than - again, the outliers pipeline to the ‘outliers’-labeled data.

This - of course could be done by subsequentially calling:

  1. qc.calibratePipeline(‘temp0’, problems=[‘noise’], problem_labels=[‘noise’], start_date=’2017-01-01’, end_date=’2017-02-01’, name=’noiseFilter’)

  2. qc.calibratePipeline(‘temp0’, problems=[‘outliers’], problem_labels=[‘outliers’], start_date=’2017-01-01’, end_date=’2017-02-01’, log_path=PATH, name=’outlierFilter’)

Which would result in 2 new methods, ‘noiseFilter’ and ‘outlierFilter’. To do the calibration in one call and also in order to generate a method/configuration file that does target both anomaly types in one go, we can also do:

>>> qc.calibratePipeline('temp0', name='anomalyDetector', problems=['noise', 'outliers'], problem_labels=['noise','outliers'], start_date='2017-01-01', end_date='2017-02-01', log_path=PATH)

Subsequently, just calling the newly generated method ‘anomalyDetector’ on any field, like so: qc.anomalyDetector('temp0'), will filter it for both the exemplified anomaly patterns.