Statistics: stationarity for ARIMA models in time series [Archive]

JeenLeen

2017-12-04, 12:21 AM

I'm taking a class on time series, and one thing that is emphasized a lot is obtaining a stationary model, that is, a model with a mean, variance, and covariance that is constant over time (or at least, covariance is constant for any given lag regardless of where in the time series the two points are).

I just don't get why stationarity matters. When I try to use R's arima() or sarima() functions to model the data, it seems to work fine regardless of whether the data are stationary or not. I can take differences or try detrending to get stationary data, but I don't see the point.

The professor at one point said that ARMA models do assume stationarity, but my models seem to work fine (i.e., normalized & independent residuals) even without any differencing (i.e. no 'I' in the ARIMA). I'm guessing the lack of it is invalidating my models, similar to how trying linear regression on data with no linear relationship is invalid... but I don't see how.

Can anyone explain why stationarity is important?

NichG

2017-12-04, 01:31 AM

Let's say you're predicting how many people who view a given advertisement will buy the product in order to decide how much money to allocate to different advertisement channels.

A model based on January to March is going to blithely assume that e.g. ski gear is a fairly high demand product. Then summer hits, and predictions made in that period have a systematic bias.

Whereas if you use data from the whole year, it behaves closer to a stationary (but more complex) process. You might now underestimate ski gear in the winter and overestimate it in the summer, but the average shouldn't be biased. Furthermore by conditioning the model on the time of year it could actually incorporate those variations.

A major signature of non-stationary is when a model appears to get worse over time. If you aren't explicitly testing that on your data, it may look like it's working but would get worse in the future.

Some level of non-stationarity is generally to be expected, so it's a thing you live with rather than necessarily a reason to not try a model in the first place. But seasonality for example is a big effect.

JeenLeen

2017-12-04, 09:01 AM

Let's say you're predicting how many people who view a given advertisement will buy the product in order to decide how much money to allocate to different advertisement channels.

A model based on January to March is going to blithely assume that e.g. ski gear is a fairly high demand product. Then summer hits, and predictions made in that period have a systematic bias.

Whereas if you use data from the whole year, it behaves closer to a stationary (but more complex) process. You might now underestimate ski gear in the winter and overestimate it in the summer, but the average shouldn't be biased. Furthermore by conditioning the model on the time of year it could actually incorporate those variations.

A major signature of non-stationary is when a model appears to get worse over time. If you aren't explicitly testing that on your data, it may look like it's working but would get worse in the future.

Some level of non-stationarity is generally to be expected, so it's a thing you live with rather than necessarily a reason to not try a model in the first place. But seasonality for example is a big effect.

In seasonal models, I generally do a regression on both the time-trend (i.e., just time, to see it there's an element generally increasing or decreasing over time) and by month (to capture how each month impacts things.) I haven't really had to do ARIMA processes alongside seasonal much, but I've tested for them--just not found them useful thus far for modeling what I'm modeling.

BUT a model with seasonal aspects doesn't usually look stationary, even if you have multiple years. The mean may be higher in the winter than in the summer each year, to use your example. We explain that with seasonality, but I never so far have needed to do something to the model (like differencing) to get rid of the seasonal non-stationarity. So thus I'm modeling it fine even with non-stationarity.

...re-reading our post, maybe I'm misunderstanding where stationarity is needed. I've been thinking that I've been told the raw data needs to be stationary. Is it instead that the model needs to be stationary, which I guess is mainly seen by checking the residuals?
That makes more sense when talking of seasonality, since I can't imagine a model with relevant and strong seasonality having a constant mean, as the seasonality itself disrupts the constancy of the mean within every given year.

(As a side note, I do generally try logging the data or a Box-Cox transformation to see if that helps stationarity or at least constant variance. And with non-seasonal models I tend to do differencing a lot. Just hasn't seemed real helpful with seasonal models thus far. And, also, after I difference the data, I'm not really sure what each time point means, so I feel a bit uncomfortable working with that method.)

NichG

2017-12-04, 09:36 AM

There's two ways things can break. One way things can break is if you assume something about the data that isn't true, and as a result the quality of your predictions drops. The other way things can break is if the convergence of the method relies on that assumption being true, so that if its violated the model won't even converge. Most of the time, its the former rather than the latter.

There are however some models where something like the assumption of stationarity would be needed to prove that e.g. parameter estimates converge like 1/sqrt(N), where if the data is nonstationary they could remain at O(1) away from true values even for an infinite time series. This is usually most severe in models where the parameters are inferred in an online manner, rather than optimization-based models where you are optimizing against the total knowledge so far (in which case eventually everything starts to look stationary). I don't know if ARIMA is one of those that goes unstable, but I suppose it could be. It wouldn't likely manifest as the thing blowing up, but would rather be that beyond a certain amount of data the parameters don't converge to a single value even when they should.

JeenLeen

2017-12-04, 10:13 AM

Honestly, some of what you said went over my head, but I know I've sometimes gotten a response about being unable to make a model, and I think it was due to either non-convergence or there being too many possible values for the computer to choose one. (Or does that mean the same thing?)

So, in a seasonal model that is non-stationary due to a different mean in some months than others: does it just need to be 'stationary enough' that convergence can occur without yielding an invalid model or just not getting a result?
If the former (model generated, but it is invalid), I reckon checking the residuals shows that the model is junk.
If the latter (no model generated), the computer fails to generate a model so no risk of using an invalid model.

NichG

2017-12-05, 06:59 AM

With non-stationarity, even if the residuals look good, they may become worse in the future. That is to say, non-stationarity can lead you to mis-estimate how bad your model actually is.

Models aren't generally okay or not. Every model is wrong, but it may still be useful. However, you need to evaluate that, and that evaluation usually relies on some assumptions about how the data behaves and how the model behaves.

Basically, there's a lot more to validation than just checking the residuals on the training data. You generally want some kind of hold out set to check against, that is related to the training data in a representative way of how the model is actually going to be used.

JeenLeen

2017-12-05, 09:06 AM

With non-stationarity, even if the residuals look good, they may become worse in the future. That is to say, non-stationarity can lead you to mis-estimate how bad your model actually is.

Thanks! I think this sentence made it really click for me.

Models aren't generally okay or not. Every model is wrong, but it may still be useful. However, you need to evaluate that, and that evaluation usually relies on some assumptions about how the data behaves and how the model behaves.

Basically, there's a lot more to validation than just checking the residuals on the training data. You generally want some kind of hold out set to check against, that is related to the training data in a representative way of how the model is actually going to be used.

That makes sense. For one project, I'm leaving the last 5 or 10 data points I collected out of the data used to generate the model, and then I plan to compare the model's predictions to what actually happened. I see now what that's an important test.

So, in essence, stationarity is one of the assumptions of time series models (presumably including ARIMA models) that is required for valid forecasting.