ufzLogo rdmLogo

Outlier Detection#

The tutorial aims to introduce into a simple to use, jet powerful method for clearing uniformly sampled, univariate data, from global und local outliers as well as outlier clusters. Therefor, we will introduce into the usage of the flagUniLOF() method, which represents a modification of the established Local Outlier Factor (LOF) algorithm and is applicable without prior modelling of the data to flag.

Example Data Import#

We load the example data set from the saqc repository using the pandas csv file reader. Subsequently, we cast the index of the imported data to DatetimeIndex <https://pandas.pydata.org/docs/reference/api/pandas.DatetimeIndex.html>, then initialize a SaQC instance using the imported data and finally we plot it via the built-in plot() method.

>>> import saqc
>>> data = pd.read_csv('./resources/data/hydro_data.csv')
>>> data = data.set_index('Timestamp')
>>> data.index = pd.DatetimeIndex(data.index)
>>> qc = saqc.SaQC(data)
>>> qc.plot('sac254_raw') 
../_images/OutlierDetection-2.png

Initial Flagging#

We start by applying the algorithm flagUniLOF() with default arguments, so the main calibration parameters n and thresh are set to 20 and 1.5 respectively.

For an detailed overview over all the parameters, as well as an introduction into the working of the algorithm, see the documentation of flagUniLOF() itself.

>>> import saqc
>>> qc = qc.flagUniLOF('sac254_raw')
>>> qc.plot('sac254_raw') 
../_images/OutlierDetection-3.png

Flagging result with default parameter configuration.#

The results from that initial shot seem to look not too bad. Most instances of obvious outliers seem to have been flagged right away and there seem to be no instances of inliers having been falsely labeled. Zooming in onto a 3 months strip on 2016, gives the impression of some not so extreme outliers having passed flagUniLOF() undetected:

../_images/OutlierDetection-4.png

Assuming the flickering values in late september also qualify as outliers, we will see how to tune the algorithm to detect those in the next section.#

Tuning Threshold Parameter#

Of course, the result from applying flagUniLOF() with default parameter settings might not always meet the expectations.

The best way to tune the algorithm, is, by tweaking one of the parameters thresh or n.

To tune thresh, find a value that slightly underflags the data, and reapply the function with evermore decreased values of thresh.

>>> qc = qc.flagUniLOF('sac254_raw', thresh=1.3, label='threshold = 1.3')
>>> qc.plot('sac254_raw') 
../_images/OutlierDetection-5.png

Result from applying flagUniLOF() again on the results for default parameter configuration, this time setting thresh parameter to 1.3.#

It seems we could sift out some more of the outlier like, flickering values. Lets lower the threshold even more:

>>> qc = qc.flagUniLOF('sac254_raw', thresh=1.1, label='threshold = 1.1')
>>> qc.plot('sac254_raw') 
../_images/OutlierDetection-6.png

Even more values get flagged with thresh=1.1#

>>> qc = qc.flagUniLOF('sac254_raw', thresh=1.05, label='threshold = 1.05')
>>> qc.plot('sac254_raw') 
../_images/OutlierDetection-7.png

Result begins to look overflagged with thresh=1.05#

The lower bound for meaningful values of thresh is 1. With threshold 1, the method labels every data point.

>>> qc = qc.flagUniLOF('sac254_raw', thresh=1, label='threshold = 1')
>>> qc.plot('sac254_raw') 
../_images/OutlierDetection-8.png

Setting thresh=1 will assign flag to all the values.#

Iterating until 1.1, seems to give quite a good overall flagging result:

../_images/OutlierDetection-9.png

Overall the outlier detection with thresh=1.1 seems to work very well. Ideally of course, we would evaluate this result against a validated set of flags while tweaking the parameters.#

The plot shows some over flagging in the closer vicinity of erratic data jumps. We will see in the next section, how to fine-tune the algorithm by shrinking the locality value n to make the process more robust in the surroundings of anomalies.

Before this, lets briefly check on this outlier cluster, at march 2016, that got correctly flagged, as well.

../_images/OutlierDetection-10.png

flagUniLOF() will reliably flag groups of outliers, with less than n/2 periods.#

Tuning Locality Parameter#

The parameter n controls the number of nearest neighbors included into the LOF calculation. So n effectively determines the size of the “neighborhood”, a data point is compared with, in order to obtain its “outlierishnes”.

Smaller values of n can lead to clearer results, because of feedback effects between normal points and outliers getting mitigated:

>>> qc = saqc.SaQC(data)
>>> qc = qc.flagUniLOF('sac254_raw', thresh=1.5, n=8, label='thresh=1.5, n= 8')
>>> qc.plot('sac254_raw', xscope=slice('2016-09','2016-11')) 
../_images/OutlierDetection-11.png

Result with n=8 and thresh=20#

Since n determines the size of the surrounding, a point is compared to, it also determines the maximal size of detectable outlier clusters. The group we were able to detect by applying flagUniLOF() with n=20, is not flagged with n=8:

../_images/OutlierDetection-12.png

A cluster with more than n/2 members, will likely not be detected by the algorithm.#

Also note, that, when changing n, you usually have to restart calibrating a good starting point for the py:attr:thresh parameter as well.

Increasingly higher values of n will make flagUniLOF() increasingly invariant to local variance and make it more of a global outlier detection function. So, an approach towards clearing an entire timeseries from outliers is to start with large n to clear the data from global outliers first, before fine-tuning thresh for smaller values of n in a second application of the algorithm.

>>> qc = saqc.SaQC(data)
>>> qc = qc.flagUniLOF('sac254_raw', thresh=1.5, n=100, label='thresh=1.5, n=100')
>>> qc.plot('sac254_raw')
../_images/OutlierDetection-13.png