.. SPDX-FileCopyrightText: 2021 Helmholtz-Zentrum für Umweltforschung GmbH - UFZ .. .. SPDX-License-Identifier: GPL-3.0-or-later Outlier Detection ================= The tutorial aims to introduce into a simple to use, jet powerful method for clearing :ref:`uniformly sampled `, *univariate* data, from global und local outliers as well as outlier clusters. Therefor, we will introduce into the usage of the :py:meth:`~saqc.SaQC.flagUniLOF` method, which represents a modification of the established `Local Outlier Factor `_ (LOF) algorithm and is applicable without prior modelling of the data to flag. * :ref:`Example Data Import ` * :ref:`Initial Flagging ` * :ref:`Tuning Threshold Parameter ` * :ref:`Tuning Locality Parameter ` Example Data Import ------------------- .. plot:: :context: reset :include-source: False import matplotlib import saqc import pandas as pd data = pd.read_csv('../resources/data/hydro_data.csv') data = data.set_index('Timestamp') data.index = pd.DatetimeIndex(data.index) qc = saqc.SaQC(data) We load the example `data set `_ from the *saqc* repository using the `pandas `_ csv file reader. Subsequently, we cast the index of the imported data to `DatetimeIndex `, then initialize a :py:class:`~saqc.SaQC` instance using the imported data and finally we plot it via the built-in :py:meth:`~saqc.SaQC.plot` method. .. doctest:: flagUniLOFExample >>> import saqc >>> data = pd.read_csv('./resources/data/hydro_data.csv') >>> data = data.set_index('Timestamp') >>> data.index = pd.DatetimeIndex(data.index) >>> qc = saqc.SaQC(data) >>> qc.plot('sac254_raw') # doctest: +SKIP .. plot:: :context: :include-source: False :class: center qc.plot('sac254_raw') Initial Flagging ---------------- We start by applying the algorithm :py:meth:`~saqc.SaQC.flagUniLOF` with default arguments, so the main calibration parameters :py:attr:`n` and :py:attr:`thresh` are set to `20` and `1.5` respectively. For an detailed overview over all the parameters, as well as an introduction into the working of the algorithm, see the documentation of :py:meth:`~saqc.SaQC.flagUniLOF` itself. .. doctest:: flagUniLOFExample >>> import saqc >>> qc = qc.flagUniLOF('sac254_raw') >>> qc.plot('sac254_raw') # doctest: +SKIP .. plot:: :context: close-figs :include-source: False :class: center :caption: Flagging result with default parameter configuration. qc = qc.flagUniLOF('sac254_raw') qc.plot('sac254_raw') The results from that initial shot seem to look not too bad. Most instances of obvious outliers seem to have been flagged right away and there seem to be no instances of inliers having been falsely labeled. Zooming in onto a 3 months strip on *2016*, gives the impression of some not so extreme outliers having passed :py:meth:`~saqc.SaQC.flagUniLOF` undetected: .. plot:: :context: close-figs :include-source: False :class: centers :caption: Assuming the flickering values in late september also qualify as outliers, we will see how to tune the algorithm to detect those in the next section. qc.plot('sac254_raw', xscope=slice('2016-09','2016-11')) Tuning Threshold Parameter -------------------------- Of course, the result from applying :py:meth:`~saqc.SaQC.flagUniLOF` with default parameter settings might not always meet the expectations. The best way to tune the algorithm, is, by tweaking one of the parameters :py:attr:`thresh` or :py:attr:`n`. To tune :py:attr:`thresh`, find a value that slightly underflags the data, and *reapply* the function with evermore decreased values of :py:attr:`thresh`. .. doctest:: flagUniLOFExample >>> qc = qc.flagUniLOF('sac254_raw', thresh=1.3, label='threshold = 1.3') >>> qc.plot('sac254_raw') # doctest: +SKIP .. plot:: :context: close-figs :include-source: False :class: center :caption: Result from applying :py:meth:`~saqc.SaQC.flagUniLOF` again on the results for default parameter configuration, this time setting :py:attr:`thresh` parameter to *1.3*. qc = qc.flagUniLOF('sac254_raw', thresh=1.3, label='threshold=1.3') qc.plot('sac254_raw', xscope=slice('2016-09','2016-11')) It seems we could sift out some more of the outlier like, flickering values. Lets lower the threshold even more: .. doctest:: flagUniLOFExample >>> qc = qc.flagUniLOF('sac254_raw', thresh=1.1, label='threshold = 1.1') >>> qc.plot('sac254_raw') # doctest: +SKIP .. plot:: :context: close-figs :include-source: False :class: center :caption: Even more values get flagged with :py:attr:`thresh=1.1` qc = qc.flagUniLOF('sac254_raw', thresh=1.1, label='thresh=1.1') qc.plot('sac254_raw', xscope=slice('2016-09','2016-11')) .. doctest:: flagUniLOFExample >>> qc = qc.flagUniLOF('sac254_raw', thresh=1.05, label='threshold = 1.05') >>> qc.plot('sac254_raw') # doctest: +SKIP .. plot:: :context: close-figs :include-source: False :class: center :caption: Result begins to look overflagged with :py:attr:`thresh=1.05` qc = qc.flagUniLOF('sac254_raw', thresh=1.05, label='thresh=1.05') qc.plot('sac254_raw', xscope=slice('2016-09','2016-11')) The lower bound for meaningful values of :py:attr:`thresh` is *1*. With threshold *1*, the method labels every data point. .. doctest:: flagUniLOFExample >>> qc = qc.flagUniLOF('sac254_raw', thresh=1, label='threshold = 1') >>> qc.plot('sac254_raw') # doctest: +SKIP .. plot:: :context: close-figs :include-source: False :class: center :caption: Setting :py:attr:`thresh=1` will assign flag to all the values. qc = qc.flagUniLOF('sac254_raw', thresh=1, label='thresh=1') qc.plot('sac254_raw', xscope=slice('2016-09','2016-11')) Iterating until `1.1`, seems to give quite a good overall flagging result: .. plot:: :context: close-figs :include-source: False :class: center :caption: Overall the outlier detection with :py:attr:`thresh=1.1` seems to work very well. Ideally of course, we would evaluate this result against a validated set of flags while tweaking the parameters. qc = saqc.SaQC(data) qc = qc.flagUniLOF('sac254_raw', thresh=1.5, label='thresh=1.5') qc = qc.flagUniLOF('sac254_raw', thresh=1.3, label='thresh=1.3') qc = qc.flagUniLOF('sac254_raw', thresh=1.1, label='thresh=1.1') qc.plot('sac254_raw') The plot shows some over flagging in the closer vicinity of erratic data jumps. We will see in the next section, how to fine-tune the algorithm by shrinking the locality value :py:attr:`n` to make the process more robust in the surroundings of anomalies. Before this, lets briefly check on this outlier cluster, at march 2016, that got correctly flagged, as well. .. plot:: :context: close-figs :include-source: False :class: center :caption: :py:meth:`~saqc.SaQC.flagUniLOF` will reliably flag groups of outliers, with less than :py:attr:`n/2` periods. qc.plot('sac254_raw', xscope=slice('2016-03-15','2016-03-17')) Tuning Locality Parameter ------------------------- The parameter :py:attr:`n` controls the number of nearest neighbors included into the LOF calculation. So :py:attr:`n` effectively determines the size of the "neighborhood", a data point is compared with, in order to obtain its "outlierishnes". Smaller values of :py:attr:`n` can lead to clearer results, because of feedback effects between normal points and outliers getting mitigated: .. doctest:: flagUniLOFExample >>> qc = saqc.SaQC(data) >>> qc = qc.flagUniLOF('sac254_raw', thresh=1.5, n=8, label='thresh=1.5, n= 8') >>> qc.plot('sac254_raw', xscope=slice('2016-09','2016-11')) # doctest: +SKIP .. plot:: :context: close-figs :include-source: False :class: center :caption: Result with :py:attr:`n=8` and :py:attr:`thresh=20` qc = saqc.SaQC(data) qc = qc.flagUniLOF('sac254_raw', n=8) qc.plot('sac254_raw', xscope=slice('2016-09','2016-11')) Since :py:attr:`n` determines the size of the surrounding, a point is compared to, it also determines the maximal size of detectable outlier clusters. The group we were able to detect by applying :py:meth:`~saqc.SaQC.flagUniLOF` with :py:attr:`n=20`, is not flagged with :py:attr:`n=8`: .. plot:: :context: close-figs :include-source: False :class: center :caption: A cluster with more than :py:attr:`n/2` members, will likely not be detected by the algorithm. qc.plot('sac254_raw', xscope=slice('2016-03-15','2016-03-17')) Also note, that, when changing :py:attr:`n`, you usually have to restart calibrating a good starting point for the py:attr:`thresh` parameter as well. Increasingly higher values of :py:attr:`n` will make :py:meth:`~saqc.SaQC.flagUniLOF` increasingly invariant to local variance and make it more of a global outlier detection function. So, an approach towards clearing an entire timeseries from outliers is to start with large :py:attr:`n` to clear the data from global outliers first, before fine-tuning :py:attr:`thresh` for smaller values of :py:attr:`n` in a second application of the algorithm. .. doctest:: flagUniLOFExample >>> qc = saqc.SaQC(data) >>> qc = qc.flagUniLOF('sac254_raw', thresh=1.5, n=100, label='thresh=1.5, n=100') >>> qc.plot('sac254_raw')# doctest: +SKIP .. plot:: :context: close-figs :include-source: False :class: center qc = saqc.SaQC(data) qc = qc.flagUniLOF('sac254_raw', thresh=1.5, n=100, label='thresh=1.5, n=100') qc.plot('sac254_raw')