What kind of techniques do you use to deal with non representative data in your timeseries models?

  • 13 April 2022
  • 1 reply

Userlevel 3
Badge +1
  • Active Contributor
  • 10 replies

Most timeseries models assume that (conditional on the model) the “future will resemble the past”. In slightly more specific terms, this involves assuming that the “properties” of the training data relevant to the model will be the same as those encountered by the model during production (or equivalently that “the data generating mechanism” won’t change in ways relevant to the model between encountering training and production data).

An important preprocessing step when performing any kind of timeseries modelling is consequently to handle time periods in the training data which are in some relevant way “not representative”. This could mean removing certain dates from the data entirely, using indicator variables to help the model distinguish between these non representative periods, or using some other kind of imputation technique.

The most appropriate way of dealing with such non-representative data will depend on the structure of the data itself (sampling frequency, distribution of values, etc.), the choice of modelling approach or algorithm (for example whether predictions are required or not), and (perhaps most importantly) the types of decisions the model is intended to support.

What kind of techniques have you used to deal with non-representative time series data? Did you try anything else? What made that technique work the best? How did you assess what “best” meant?

1 reply

Userlevel 3
Badge +1

To contribute a recent experience of my own:

I’ve had recently had some success ignoring “abnormal” periods of some particular (univariate) timeseries and using GP regression to do the forecasting.

Since the GP doesn’t care about having equally spaced sample points, the unrepresentative periods could just be omitted. This somehow felt “more clean” than trying to salvage the data from those periods. Other motivating factors were that we needed a probabilistic forecast and there were relatively few sample points.

“Success” here meant having a robust forecast with quantified predictive uncertainty that did slightly better in terms of MAE than a “seasonally naive” approach (i.e. using the value from the same time of the last year that had “representative” data as the forecast).