Simple Time Series Forecasting Models to Test So That You Don't Fool Yourself

阿新 • • 發佈：2019-01-12

It is important to establish a strong baseline of performance on a time series forecasting problem and to not fool yourself into thinking that sophisticated methods are skillful, when in fact they are not.

This requires that you evaluate a suite of standard naive, or simple, time series forecasting models to get an idea of the worst acceptable performance on the problem for more sophisticated models to beat.

Applying these simple models can also uncover new ideas about more advanced methods that may result in better performance.

In this tutorial, you will discover how to implement and automate three standard baseline time series forecasting methods on a real world dataset.

Specifically, you will learn:

How to automate the persistence model and test a suite of persisted values.
How to automate the expanding window model.
How to automate the rolling window forecast model and test a suite of window sizes.

This is an important topic and highly recommended for any time series forecasting project.

Let’s get started.

Overview

This tutorial is broken down into the following 5 parts:

Monthly Car Sales Dataset: An overview of the standard time series dataset we will use.
Test Setup: How we will evaluate forecast models in this tutorial.
Persistence Forecast: The persistence forecast and how to automate it.
Expanding Window Forecast: The expanding window forecast and how to automate it.
Rolling Window Forecast: The rolling window forecast and how to automate it.

An up-to-date Python SciPy environment is used, including Python 2 or 3, Pandas, Numpy, and Matplotlib.

Monthly Car Sales Dataset

In this tutorial, we will use the Monthly Car Sales dataset.

This dataset describes the number of car sales in Quebec, Canada between 1960 and 1968.

The units are a count of the number of sales and there are 108 observations. The source data is credited to Abraham and Ledolter (1983).

Download the dataset and save it into your current working directory with the filename “car-sales.csv“. Note, you may need to delete the footer information from the file.

The code below loads the dataset as a Pandas Series object.

# line plot of time series
from pandas import Series
from matplotlib import pyplot
# load dataset
series = Series.from_csv('car-sales.csv', header=0)
# display first few rows
print(series.head(5))
# line plot of dataset
series.plot()
pyplot.show()

12345678910

# line plot of time seriesfrom pandas import Seriesfrom matplotlib import pyplot# load datasetseries=Series.from_csv('car-sales.csv',header=0)# display first few rowsprint(series.head(5))# line plot of datasetseries.plot()pyplot.show()

Running the example prints the first 5 rows of data.

Month
1960-01-01 6550
1960-02-01 8728
1960-03-01 12026
1960-04-01 14395
1960-05-01 14587
Name: Sales, dtype: int64

1234567

Month1960-01-01 65501960-02-01 87281960-03-01 120261960-04-01 143951960-05-01 14587Name: Sales, dtype: int64

A line plot of the data is also provided.

Monthly Car Sales Dataset Line Plot

Experimental Test Setup

It is important to evaluate time series forecasting models consistently.

In this section, we will define how we will evaluate the three forecast models in this tutorial.

First, we will hold the last two years of data back and evaluate forecasts on this data. Given the data is monthly, this means that the last 24 observations will be used as test data.

We will use a walk-forward validation method to evaluate model performance. This means that each time step in the test dataset will be enumerated, a model constructed on history data, and the forecast compared to the expected value. The observation will then be added to the training dataset and the process repeated.

Walk-forward validation is a realistic way to evaluate time series forecast models as one would expect models to be updated as new observations are made available.

Finally, forecasts will be evaluated using root mean squared error or RMSE. The benefit of RMSE is that it penalizes large errors and the scores are in the same units as the forecast values (car sales per month).

In summary, the test harness involves:

The last 2 years of data used a test set.
Walk-forward validation for model evaluation.
Root mean squared error used to report model skill.

Optimized Persistence Forecast

The persistence forecast involves using the previous observation to predict the next time step.

For this reason, the approach is often called the naive forecast.

Why stop with using the previous observation? In this section, we will look at automating the persistence forecast and evaluate the use of any arbitrary prior time step to predict the next time step.

We will explore using each of the prior 24 months of point observations in a persistence model. Each configuration will be evaluated using the test harness and RMSE scores collected. We will then display the scores and graph the relationship between the persisted time step and the model skill.

The complete example is listed below.

from pandas import Series
from sklearn.metrics import mean_squared_error
from math import sqrt
from matplotlib import pyplot
# load data
series = Series.from_csv('car-sales.csv', header=0)
# prepare data
X = series.values
train, test = X[0:-24], X[-24:]
persistence_values = range(1, 25)
scores = list()
for p in persistence_values:
	# walk-forward validation
	history = [x for x in train]
	predictions = list()
	for i in range(len(test)):
		# make prediction
		yhat = history[-p]
		predictions.append(yhat)
		# observation
		history.append(test[i])
	# report performance
	rmse = sqrt(mean_squared_error(test, predictions))
	scores.append(rmse)
	print('p=%d RMSE:%.3f' % (p, rmse))
# plot scores over persistence values
pyplot.plot(persistence_values, scores)
pyplot.show()

12345678910111213141516171819202122232425262728

from pandas import Seriesfrom sklearn.metrics import mean_squared_errorfrom math import sqrtfrom matplotlib import pyplot# load dataseries=Series.from_csv('car-sales.csv',header=0)# prepare dataX=series.valuestrain,test=X[0:-24],X[-24:]persistence_values=range(1,25)scores=list()forpinpersistence_values:# walk-forward validationhistory=[xforxintrain]predictions=list()foriinrange(len(test)):# make predictionyhat=history[-p]predictions.append(yhat)# observationhistory.append(test[i])# report performancermse=sqrt(mean_squared_error(test,predictions))scores.append(rmse)print('p=%d RMSE:%.3f'%(p,rmse))# plot scores over persistence valuespyplot.plot(persistence_values,scores)pyplot.show()

Running the example prints the RMSE for each persisted point observation.

p=1 RMSE:3947.200
p=2 RMSE:5485.353
p=3 RMSE:6346.176
p=4 RMSE:6474.553
p=5 RMSE:5756.543
p=6 RMSE:5756.076
p=7 RMSE:5958.665
p=8 RMSE:6543.266
p=9 RMSE:6450.839
p=10 RMSE:5595.971
p=11 RMSE:3806.482
p=12 RMSE:1997.732
p=13 RMSE:3968.987
p=14 RMSE:5210.866
p=15 RMSE:6299.040
p=16 RMSE:6144.881
p=17 RMSE:5349.691
p=18 RMSE:5534.784
p=19 RMSE:5655.016
p=20 RMSE:6746.872
p=21 RMSE:6784.611
p=22 RMSE:5642.737
p=23 RMSE:3692.062
p=24 RMSE:2119.103

123456789101112131415161718192021222324

p=1 RMSE:3947.200p=2 RMSE:5485.353p=3 RMSE:6346.176p=4 RMSE:6474.553p=5 RMSE:5756.543p=6 RMSE:5756.076p=7 RMSE:5958.665p=8 RMSE:6543.266p=9 RMSE:6450.839p=10 RMSE:5595.971p=11 RMSE:3806.482p=12 RMSE:1997.732p=13 RMSE:3968.987p=14 RMSE:5210.866p=15 RMSE:6299.040p=16 RMSE:6144.881p=17 RMSE:5349.691p=18 RMSE:5534.784p=19 RMSE:5655.016p=20 RMSE:6746.872p=21 RMSE:6784.611p=22 RMSE:5642.737p=23 RMSE:3692.062p=24 RMSE:2119.103

A plot of the persisted value (t-n) to model skill (RMSE) is also created.

From the results, it is clear that persisting the observation from 12 months ago or 24 months ago is a great starting point on this dataset.

The best result achieved involved persisting the result from t-12 with an RMSE of 1997.732 car sales.

This is an obvious result, but also very useful.

We would expect that a forecast model that is some weighted combination of the observations at t-12, t-24, t-36 and so on would be a powerful starting point.

It also points out that the naive t-1 persistence would have been a less desirable starting point on this dataset.

Persisted Observation to RMSE on the Monthly Car Sales Dataset

We can use the t-12 model to make a prediction and plot it against the test data.

The complete example is listed below.

from pandas import Series
from sklearn.metrics import mean_squared_error
from math import sqrt
from matplotlib import pyplot
# load data
series = Series.from_csv('car-sales.csv', header=0)
# prepare data
X = series.values
train, test = X[0:-24], X[-24:]
# walk-forward validation
history = [x for x in train]
predictions = list()
for i in range(len(test)):
	# make prediction
	yhat = history[-12]
	predictions.append(yhat)
	# observation
	history.append(test[i])
# plot predictions vs observations
pyplot.plot(test)
pyplot.plot(predictions)
pyplot.show()

12345678910111213141516171819202122

from pandas import Seriesfrom sklearn.metrics import mean_squared_errorfrom math import sqrtfrom matplotlib import pyplot# load dataseries=Series.from_csv('car-sales.csv',header=0)# prepare dataX=series.valuestrain,test=X[0:-24],X[-24:]# walk-forward validationhistory=[xforxintrain]predictions=list()foriinrange(len(test)):# make predictionyhat=history[-12]predictions.append(yhat)# observationhistory.append(test[i])# plot predictions vs observationspyplot.plot(test)pyplot.plot(predictions)pyplot.show()

Running the example plots the test dataset (blue) against the predicted values (orange).

Line Plot of Predicted Values vs Test Dataset for the t-12 Persistence Model

You can learn more about the persistence model for time series forecasting in the post:

Expanding Window Forecast

An expanding window refers to a model that calculates a statistic on all available historic data and uses that to make a forecast.

It is an expanding window because it grows as more real observations are collected.

Two good starting point statistics to calculate are the mean and the median historical observation.

The example below uses the expanding window mean as the forecast.

from pandas import Series
from sklearn.metrics import mean_squared_error
from math import sqrt
from numpy import mean
# load data
series = Series.from_csv('car-sales.csv', header=0)
# prepare data
X = series.values
train, test = X[0:-24], X[-24:]
# walk-forward validation
history = [x for x in train]
predictions = list()
for i in range(len(test)):
	# make prediction
	yhat = mean(history)
	predictions.append(yhat)
	# observation
	history.append(test[i])
# report performance
rmse = sqrt(mean_squared_error(test, predictions))
print('RMSE: %.3f' % rmse)

123456789101112131415161718192021

from pandas import Seriesfrom sklearn.metrics import mean_squared_errorfrom math import sqrtfrom numpy import mean# load dataseries=Series.from_csv('car-sales.csv',header=0)# prepare dataX=series.valuestrain,test=X[0:-24],X[-24:]# walk-forward validationhistory=[xforxintrain]predictions=list()foriinrange(len(test)):# make predictionyhat=mean(history)predictions.append(yhat)# observationhistory.append(test[i])# report performancermse=sqrt(mean_squared_error(test,predictions))print('RMSE: %.3f'%rmse)

Running the example prints the RMSE evaluation of the approach.

RMSE: 5113.067

1	RMSE: 5113.067

We can also repeat the same experiment with the median of the historical observations. The complete example is listed below.

from pandas import Series
from sklearn.metrics import mean_squared_error
from math import sqrt
from numpy import median
# load data
series = Series.from_csv('car-sales.csv', header=0)
# prepare data
X = series.values
train, test = X[0:-24], X[-24:]
# walk-forward validation
history = [x for x in train]
predictions = list()
for i in range(len(test)):
	# make prediction
	yhat = median(history)
	predictions.append(yhat)
	# observation
	history.append(test[i])
# report performance
rmse = sqrt(mean_squared_error(test, predictions))
print('RMSE: %.3f' % rmse)

123456789101112131415161718192021

from pandas import Seriesfrom sklearn.metrics import mean_squared_errorfrom math import sqrtfrom numpy import median# load dataseries=Series.from_csv('car-sales.csv',header=0)# prepare dataX=series.valuestrain,test=X[0:-24],X[-24:]# walk-forward validationhistory=[xforxintrain]predictions=list()foriinrange(len(test)):# make predictionyhat=median(history)predictions.append(yhat)# observationhistory.append(test[i])# report performancermse=sqrt(mean_squared_error(test,predictions))print('RMSE: %.3f'%rmse)

Again, running the example prints the skill of the model.

We can see that on this problem the historical mean produced a better result

Simple Time Series Forecasting Models to Test So That You Don't Fool Yourself

Overview

Monthly Car Sales Dataset

Experimental Test Setup

Optimized Persistence Forecast

Expanding Window Forecast

Simple Time Series Forecasting Models to Test So That You Don't Fool Yourself

Time series Forecasting — ARIMA models

How to Create an ARIMA Model for Time Series Forecasting in Python

Introduction to Time Series Forecasting With Python

How to Get Good Results Fast with Deep Learning for Time Series Forecasting

Multivariate Time Series Forecasting with LSTMs in Keras 中文版翻譯

step Time Series Forecasting with Machine Learning for Household Electricity Consumption

LSTM Model Architecture for Rare Event Time Series Forecasting

愉快的學習就從翻譯開始吧_Multivariate Time Series Forecasting with LSTMs in Keras_3_Multivariate LSTM Forecast

5 Top Books on Time Series Forecasting With R

Feature Selection for Time Series Forecasting with Python

What Is Time Series Forecasting?

10 Challenging Machine Learning Time Series Forecasting Problems

Multivariate Time Series Forecasting with LSTMs in Keras

I fight, so you don't have to

WampServer出現You don’t have permission to access/on this server提示

#錯誤處理#Error 1304: Error writing to file: Verify that you have access to that directory while upgrading ArcGIS Softwares

Wireshark 抓包遇到 you don’t have permission to capture on that device mac 錯誤的解決方案

mac osx下apache下的坑: you don’t have permission to access / on this server

apache2.4 You don‘t have permission to access / on

Simple Time Series Forecasting Models to Test So That You Don't Fool Yourself

Overview

Monthly Car Sales Dataset

Experimental Test Setup

Optimized Persistence Forecast

Expanding Window Forecast

相關推薦