ARMA, ARIMA, SARIMA#

../../_images/cover4.png

Overview#

  • In the previous four lessons, we learned about stationarity, smoothing, trend, seasonality, and autocorrelation, and you built two different kinds of models:

MA models: The current value of the series depends linearly on the series’ mean and a set of prior (observed) white noise error terms.

AR models: The current value of the series depends linearly on its own previous values and on a stochastic term (an imperfectly predictable term).

  • In this lesson we will review these concepts and combine the AR and MA models into three more complicated ones: ARMA, ARIMA, and SARIMA.

In particular, we will cover:

  1. Autoregressive Moving Average (ARMA) models.

  2. Autoregressive Integrated Moving Average (ARIMA) models.

  3. SARIMA models (ARIMA model for data with seasonality).

  4. Selecting the best model.

import sys

# Install dependencies if the notebook is running in Colab
if 'google.colab' in sys.modules:
    !pip install -U -qq tsa-course pmdarima
# Imports
import warnings
from statsmodels.tools.sm_exceptions import ConvergenceWarning, ValueWarning
warnings.simplefilter('ignore', ConvergenceWarning)
warnings.simplefilter('ignore', ValueWarning)
import time
from itertools import product
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels as ss
import seaborn as sns
from tqdm.notebook import tqdm
from scipy import stats
from scipy.fft import fft
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import month_plot, plot_acf, plot_pacf
from statsmodels.graphics.gofplots import qqplot
from statsmodels.tsa.stattools import adfuller, acf, pacf
from statsmodels.tsa.seasonal import STL, seasonal_decompose
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tools.eval_measures import mse
from statsmodels.tsa.statespace.tools import diff
import pmdarima as pm
from tsa_course.lecture1 import fft_analysis
np.random.seed(0)                

ARMA#

The ARMA model (also known as the Box-Jenkins approach) combines two models:

  • An autoregressive (AR) model of order \(p\).

  • A moving average (MA) model of order \(q\).

  • When we have autocorrelation between outcomes and their ancestors, there will be a pattern in the time series.

  • This relationship can be modeled using an ARMA model.

  • It allows us to predict the future with a confidence level proportional to the strength of the relationship and the proximity to known values (prediction weakens the further out we go).

📝 Note

  • ARMA models assume the time series is assumed to be stationary.

  • A good rule of thumb is to have at least 100 observations when fitting an ARMA model.

Load data#

  • In the following, we’ll look at the monthly average temperatures between 1907-1972.

# load data and convert to datetime
monthly_temp = pd.read_csv('https://zenodo.org/records/10951538/files/arima_temp.csv?download=1', 
                           skipfooter=2, 
                           header=0, 
                           index_col=0, 
                           names=['month', 'temp'],
                           engine='python')
monthly_temp.index = pd.to_datetime(monthly_temp.index)
  • This is how the data looks like.

monthly_temp.head()
temp
month
1907-01-01 33.3
1907-02-01 46.0
1907-03-01 43.0
1907-04-01 55.0
1907-05-01 51.8
  • These are some statistics.

monthly_temp.describe()
temp
count 792.000000
mean 53.553662
std 15.815452
min 11.200000
25% 39.675000
50% 52.150000
75% 67.200000
max 82.400000
  • This is the run sequence plot.

monthly_temp['temp'].plot(grid=True, figsize=(14, 4), title="Monthly temperatures");
../../_images/b44598d0c0c917238a19167e109ba9a66aafdc9fa701f7d50e5023a08306f471.png
  • Compute the annual mean and plot it on top of the actual data.

# Compute annual mean 
annual_temp = monthly_temp.resample('A').mean()
annual_temp.index.name = 'year'

plt.figure(figsize=(14, 4))
plt.plot(monthly_temp, label="Monthly Temperatures")
plt.plot(annual_temp, label="Annual Mean")
plt.grid(); plt.legend();
/tmp/ipykernel_130790/1777206218.py:2: FutureWarning: 'A' is deprecated and will be removed in a future version, please use 'YE' instead.
  annual_temp = monthly_temp.resample('A').mean()
../../_images/6636811e066d8c1ddff0c50e1cafe8ee2f81ee2334f7c62bc6d2ccda24b86fe2.png
  • This gives us an indication that the mean is rather constant over the years.

  • We can extract further information abouth the underlying trend and seasonality by performing a seasonal decomposition.

  • We can use both the seasonal_decompose and the STL methods.

decomposition = seasonal_decompose(x=monthly_temp['temp'], model='additive', period=12)
seasonal, trend, resid = decomposition.seasonal, decomposition.trend, decomposition.resid

fig, axs = plt.subplots(2,2, sharex=True, figsize=(18,6))
axs[0,0].plot(monthly_temp['temp'])
axs[0,0].set_title('Original')
axs[0,1].plot(seasonal)
axs[0,1].set_title('Seasonal')
axs[1,0].plot(trend)
axs[1,0].set_title('Trend')
axs[1,1].plot(resid)
axs[1,1].set_title('Residual')
plt.tight_layout()
../../_images/8741c8f74b816c1d2029a34fec72dde88f5bf0a914a696c014d9887cc1251938.png
decomposition = STL(endog=monthly_temp['temp'], period=12, seasonal=13, robust=True).fit()
seasonal, trend, resid = decomposition.seasonal, decomposition.trend, decomposition.resid

fig, axs = plt.subplots(2,2, sharex=True, figsize=(18,6))
axs[0,0].plot(monthly_temp['temp'])
axs[0,0].set_title('Original')
axs[0,1].plot(seasonal)
axs[0,1].set_title('Seasonal')
axs[1,0].plot(trend)
axs[1,0].set_title('Trend')
axs[1,1].plot(resid)
axs[1,1].set_title('Residual')
plt.tight_layout()
../../_images/0845b84c08888fa44d7234e25c2754efcdf73047491add49131416d837928dd2.png
  • The seasonality is well defined.

  • It doesn’t seem to be a strong, time-varying trend in the data.

    • We can assume the trend is constant.


ARMA modeling stages#

There are three stages in building an ARMA model:

  1. Model identification.

  2. Model estimation.

  3. Model evaluation.

Model identification#

  • Model identification consists in finding the orders \(p\) and \(q\) AR and MA components.

  • Before performing model identification we need to:

    1. Determine if the time series is stationary.

    2. Determine if the time series has seasonal component.

Determine stationarity#

  • We will use tools we already know (ADF test).

  • We can also look at the rolling mean and std.


⚠ Attention! ⚠

  • Before we continue, let’s consider the following result

sinusoid = np.sin(np.arange(200))
_, pvalue, _, _, _, _ = adfuller(sinusoid)
print(f'p-value: {pvalue}')
p-value: 0.0
  • Periodic signals, by their nature, have means and variances that repeat over the period of the cycle.

  • This implies that their statistical properties are functions of time within each period.

  • For instance, the mean of a periodic signal over one cycle may be constant.

  • However, when considering any point in time relative to the cycle, the instantaneous mean of the signal can vary.

  • Similarly, the variance can fluctuate within the cycle.

  • The ADF test specifically looks for a unit root (more on this later on).

  • A unit root indicates that shocks to the time series have a permanent effect, causing drifts in the level of the series.

  • A sinusoidal function, by contrast, is inherently mean-reverting within its cycles.

  • After a peak a sinusoid reverts to its mean and any “shock” in terms of phase shift or amplitude change does not alter its oscillatory nature.

  • It’s crucial to note that the ADF test’s conclusion of stationarity for a sinusoid does not imply that the sinusoid is stationary.

  • The test’s conclusion is about the absence of a unit root.

  • This does not imply that the mean and variance are constant within the periodic fluctuations.


def adftest(series, plots=True):
    out = adfuller(series, autolag='AIC')
    print(f'ADF Statistic: {out[0]:.2f}')
    print(f'p-value: {out[1]:.3f}')
    print(f"Critical Values: {[f'{k}: {r:.2f}' for r,k in zip(out[4].values(), out[4].keys())]}\n")
    
    if plots:
        # Compute rolling statistics
        rolmean = series.rolling(window=12).mean()
        rolstd = series.rolling(window=12).std()

        # Plot rolling statistics:
        plt.figure(figsize=(14, 4))
        plt.plot(series, color='tab:blue',label='Original')
        plt.plot(rolmean, color='tab:red', label='Rolling Mean')
        plt.plot(rolstd, color='black', label = 'Rolling Std')
        plt.legend(loc='best')
        plt.title('Rolling Mean and Standard Deviation')
        plt.grid(); 
# run ADF on monthly temperatures
adftest(monthly_temp.temp)
ADF Statistic: -6.48
p-value: 0.000
Critical Values: ['1%: -3.44', '5%: -2.87', '10%: -2.57']
../../_images/784317ab4af05c72b96186eb0e9ddfd1bd1bcad7827e4228bfe83fda2c326a49.png
  • The p-value indicates that the time series is stationary…

  • … even if it clearly has a periodic component.

  • The rolling mean and rolling standard deviation seem globally constant along the time series…

  • … even if they change locally within the period.

Determine seasonality#

We can determine if seasonality is present by using the following tools:

  • Autocorrelation plot.

  • Seasonal subseries plot (month plot).

  • Fourier Transform.

# run ADF on annual means
adftest(annual_temp.temp, plots=False) # no point in plotting the rolling mean/std here
ADF Statistic: -7.88
p-value: 0.000
Critical Values: ['1%: -3.54', '5%: -2.91', '10%: -2.59']
# Generate synthetic time series data
dates = pd.date_range(start='2010-01-01', periods=60, freq='M')  # Monthly data for 5 years
seas = 12 # change this and see how the plots change
data = np.sin(np.arange(60)*2*np.pi/seas) + np.random.normal(loc=0, scale=0.2, size=60)  # Seasonal data with noise
series = pd.Series(data, index=dates)
fig, axes = plt.subplots(1,3,figsize=(16,4))
series.plot(ax=axes[0], title="Original time series")

# ACF Plot
plot_acf(series, lags=36, ax=axes[1]);

# Convert series to a DataFrame and add a column for the month
df = series.to_frame(name='Value')
df['Month'] = df.index.month

# Seasonal Subseries Plot
month_plot(df['Value'], ax=axes[2]); axes[2].set_title("Seasonal Subseries Plot");
/tmp/ipykernel_130790/2070533820.py:2: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  dates = pd.date_range(start='2010-01-01', periods=60, freq='M')  # Monthly data for 5 years
../../_images/004551c314790d38ea8f3a30f81a041d1d56f9cffe95fcd2c3c880ac4fa9c41f.png
  • Let’s look at the real data now.

_, ax = plt.subplots(1,1, figsize=(7,3))
month_plot(monthly_temp, ax=ax)
plt.tight_layout();
../../_images/829d1ed3f66d70d0a3d8ae4e86a28d78753a775e69bb4fbe4420e514da036a58.png
  • Notice that a violinplot can give a very similar information to the month_plot.

_, ax = plt.subplots(1,1, figsize=(8,3))
sns.violinplot(x=monthly_temp.index.month, y=monthly_temp.temp, ax=ax) # notice the indexing on the x by month
plt.grid();
../../_images/af0019aa049ec4e0f6b40890da6cffad260ccb0fc66e06de76ccd35115ead759.png
  • Finally, to obtain the numerical value of the main periodicity we can use the Foruier Transfrom (more on this later).

  • Here, we’ll use the function we defined in the first lecture.

dominant_period, _, _ = fft_analysis(monthly_temp['temp'].values)
print(f"Dominant period: {np.round(dominant_period)}")
Dominant period: 12.0

Remove the main seasonality#

  • In this case, it is clear that the main seasonality is \(L=12\).

  • We can remove it with a seasonal differencing.

monthly_temp['Seasonally_Differenced'] = monthly_temp['temp'].diff(12)
# Drop nan
monthly_temp_clean = monthly_temp.dropna()
monthly_temp_clean 
temp Seasonally_Differenced
month
1908-01-01 35.6 2.3
1908-02-01 35.2 -10.8
1908-03-01 48.1 5.1
1908-04-01 50.0 -5.0
1908-05-01 46.8 -5.0
... ... ...
1972-08-01 75.6 -4.9
1972-09-01 64.1 -1.7
1972-10-01 51.7 0.6
1972-11-01 40.3 -1.5
1972-12-01 30.3 -0.3

780 rows × 2 columns

⚙ Try it yourself

  • Try redoing the previous plots on the differenced data!

Identifying \(p\) and \(q\)#

As we learned in the previous lesson, we will identify the AR order \(p\) and the MA order \(q\) with:

  • Autocorrelation function (ACF) plot.

  • Partial autocorrelation function (PACF) plot.

AR(\(p\))

  • The order of the AR model is identified as follows:

    • Plot 95% confidence interval on the PACF (done automatically by statsmodels).

    • Choose lag \(p\) such that the partial autocorrelation becomes insignificant for \(p+1\) and beyond.

  • If a process depends on previous values of itself then it is an AR process.

  • If it depends on previous errors than it is an MA process.

  • An AR process propagates shocks infinitely.

  • AR processes will exhibit exponential decay in ACF and a cut-off in PACF.

_, ax = plt.subplots(1, 1, figsize=(5, 3))
plot_pacf(monthly_temp_clean['Seasonally_Differenced'], lags=20, ax=ax); 
../../_images/add2dd9af19adc3c09ee45ca693ab5f56da3487d7b31db64a094a89b50420b7a.png
  • It looks like the PACF becomes zero at lag 2.

  • However there is a non-zero partial autocorrelation at lag 3.

  • The optimal value might be \(p=1\), \(p=2\), or \(p=3\).

  • Note that there are high partial autocorrelations at higher lags, especially 12.

    • This is an effect from seasonality and seasonal differencing.

    • It should not be accounted when chosing \(p\).

MA(\(q\))

  • The order of the MA model is identified as follows:

    • Plot 95% confidence interval on the ACF (done automatically by statsmodels).

    • Choose lag \(q\) such that ACF becomes statistically zero for \(q+1\) and beyond.

  • MA models do not propagate shocks infinitely; they die after \(q\) lags.

  • MA processes will exhibit exponential decay in PACF and a cut-off in ACF

_, ax = plt.subplots(1, 1, figsize=(5, 3))
plot_acf(monthly_temp_clean['Seasonally_Differenced'], lags=20, ax=ax); 
../../_images/c1fdfd60d27bfe68c8e9ebaa106599e5c1da9b69ac0ed6a211f2397c98efa271.png
  • Also in this case there are non-zero autocorrelations at lags 1 and 3.

  • So, the values to try are \(q=1\), \(q=2\), or \(q=3\).

Model estimation#

  • Once the orders \(p\) and \(q\) are identified is necessary to estimate the parameters \(\phi_1, \dots, \phi_p\) or the AR part the the parameters \(\theta_1, \dots, \theta_q\) of the MA part.

  • Estimating the parameters of an ARMA model is a complicated, nonlinear problem.

  • Nonlinear least squares and maximum likelihood estimation (MLE) are common approaches.

  • Many modern software programs will fit the ARMA model for us.

  • We split the data in two parts:

    • the training set, that will be used to fit the model’s parameters.

    • the test set, that will be used later on to evaluate the prediction performance of the model on unseen data.

train = monthly_temp_clean['Seasonally_Differenced'][:-36]
test = monthly_temp_clean['Seasonally_Differenced'][-36:]

plt.figure(figsize=(12,3))
plt.plot(train)
plt.plot(test);
../../_images/94354eec9dc810b913177b15b37289c39f32d70c91becc6611c837efb9aab505.png
model = ARIMA(train, order=(3, 0, 3))  # ARIMA with d=0 is equivalent to ARMA
fit_model = model.fit()

print(fit_model.summary())
                                 SARIMAX Results                                  
==================================================================================
Dep. Variable:     Seasonally_Differenced   No. Observations:                  744
Model:                     ARIMA(3, 0, 3)   Log Likelihood               -2248.119
Date:                    Tue, 07 May 2024   AIC                           4512.238
Time:                            16:24:06   BIC                           4549.134
Sample:                        01-01-1908   HQIC                          4526.460
                             - 12-01-1969                                         
Covariance Type:                      opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0644      0.245      0.263      0.793      -0.416       0.544
ar.L1         -0.0891      0.137     -0.648      0.517      -0.358       0.180
ar.L2         -0.8305      0.023    -35.754      0.000      -0.876      -0.785
ar.L3          0.0521      0.120      0.435      0.664      -0.183       0.287
ma.L1          0.2842      0.136      2.092      0.036       0.018       0.551
ma.L2          0.9966      0.012     84.858      0.000       0.974       1.020
ma.L3          0.2394      0.136      1.759      0.079      -0.027       0.506
sigma2        24.3189      1.008     24.134      0.000      22.344      26.294
===================================================================================
Ljung-Box (L1) (Q):                   0.03   Jarque-Bera (JB):                56.84
Prob(Q):                              0.86   Prob(JB):                         0.00
Heteroskedasticity (H):               0.77   Skew:                             0.12
Prob(H) (two-sided):                  0.04   Kurtosis:                         4.33
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

ARMA Model Validation#

  • How do you know if your ARMA model is any good?

  • We can check the resiudals, i.e., what the model was not able to fit.

  • The residuals should approximate a Gaussian distribution (aka white noise).

  • Otherwise, we might need to select a better model.

residuals = fit_model.resid

plt.figure(figsize=(10,3))
plt.plot(residuals)
plt.title("Residuals");
../../_images/d259f4a0e2535ff5f662eb90b0ff808caaea4bb2f8ac0e2ac149b9059ff01c05.png

🤔 How to test if the residuals look like noise?

  • We will use both visual inspection and statistical tests.

  • Visual inspection:

    • ACF plot.

    • Histogram.

    • QQ plot.

  • Statistical tests:

    • Normality.

    • Autocorrelation.

    • Heteroskedasticity.

Visual inspection#

ACF plot

  • Checks for any autocorrelation in the residuals.

  • White noise should show no significant autocorrelation at all lags.

_, ax = plt.subplots(1, 1, figsize=(5, 3))
plot_acf(residuals, lags=10, ax=ax)
plt.title('ACF of Residuals')
plt.show()
../../_images/08153cde184a424eede91135351d90329ab21789eea903c4f9af3c90af7391bb.png

Histogram and QQ-Plot

  • Assess the normality of the residuals.

  • White noise should ideally follow a normal distribution.

# Histogram
plt.figure(figsize=(8,3))
plt.hist(residuals, bins=20, density=True, alpha=0.6, color='g')
# Add the normal distribution curve
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, np.mean(residuals), np.std(residuals))
plt.plot(x, p, 'k', linewidth=2)
title = "Fit Results: mu = %.2f,  std = %.2f" % (np.mean(residuals), np.std(residuals))
plt.title(title)
plt.show()
../../_images/0d73e4f891d0e926a0ee61a68a216131e18ba1b0b416e258d379ec664d4dce37.png
# QQ-Plot
_, ax = plt.subplots(1, 1, figsize=(6, 6))
qqplot(residuals, line='s', ax=ax)
plt.title('QQ-Plot of Residuals')
plt.show()
../../_images/3af8c40276e67ba6e5a5bbeab077b9e554d6be0d0e4c351582d1045a2c11a631.png
  • The plots are conveniently summarized in the function plot_diagnostics() that can be called on the fit model.

fit_model.plot_diagnostics(figsize=(10, 8))
plt.tight_layout();
../../_images/628e9db635bdcd1ede0d9329b2300545e9a75cb8d33fd161907da9422e09e687.png

Statistical tests#

Normality: Jarque-Bera and Shapiro-Wilk tests

\(H_0\): the residuals are normally distributed.

norm_val, norm_p, skew, kurtosis = fit_model.test_normality('jarquebera')[0]
print('Normality (Jarque-Bera) p-value:{:.3f}'.format(norm_p))
Normality (Jarque-Bera) p-value:0.000
shapiro_test = stats.shapiro(residuals)
print(f'Normality (Shapiro-Wilk) p-value: {shapiro_test.pvalue:.3f}')
Normality (Shapiro-Wilk) p-value: 0.000
  • The small p-values allow us to reject \(H_0\).

  • Conclusion: the residuals are not normally distributed.

📝 Note

  • For reference, let’s see what these tests say about data that are actually normally distributed.

  • Try executing the cell below multiple times and see how much the results changes each time.

  • These tests start to be reliable only for large sample sizes (\(N>5000\)).

# generate random normal data
normal_data = np.random.normal(loc=0, scale=1, size=1000)

jb_test = stats.jarque_bera(normal_data)
print(f'Normality (Jarque-Bera) p-value: {jb_test.pvalue:.3f}')

shapiro_test = stats.shapiro(normal_data)
print(f'Normality (Shapiro-Wilk) p-value: {shapiro_test.pvalue:.3f}')
Normality (Jarque-Bera) p-value: 0.811
Normality (Shapiro-Wilk) p-value: 0.938

Autocorrelation: Ljung-Box test

\(H_0\): the residuals are independently distributed (no autocorrelation).

  • There is a \(p\)-value for each lag.

  • Here we just take the mean, but one might also want to look at the at largest lag (pval[-1]).

  • It is also not always obvious to select how many lags should be used in the test…

statistic, pval = fit_model.test_serial_correlation(method='ljungbox', lags=10)[0]
print(f'Ljung-Box p-value: {pval.mean():.3f}') 
Ljung-Box p-value: 0.343

Autocorrelation: Durbin Watson test

  • Tests autocorrelation in the residuals.

  • We want something between 1-3.

  • 2 is ideal (no serial correlation).

durbin_watson = ss.stats.stattools.durbin_watson(fit_model.filter_results.standardized_forecasts_error[0, fit_model.loglikelihood_burn:])
print('Durbin-Watson: d={:.2f}'.format(durbin_watson))
Durbin-Watson: d=2.01

Heteroskedasticity test

  • Tests for change in variance between residuals.

\(H_0\): no heteroskedasticity.

  • \(H_0\) indicates different things based on the alternative \(H_A\):

    • \(H_A\): Increasing, \(H_0\) the variance is not decreasing throughout the series.

    • \(H_A\): Decreasing, \(H_0\) the variance is not increasing throughout the series.

    • \(H_A\): Two-sided (default), \(H_0\) the variance is not changing throughout the series.

_, pval = fit_model.test_heteroskedasticity('breakvar', alternative='increasing')[0]
print(f'H_a: Increasing - pvalue:{pval:.3f}')
H_a: Increasing - pvalue:0.980
_, pval = fit_model.test_heteroskedasticity('breakvar', alternative='decreasing')[0]
print(f'H_a: Decreasing - pvalue:{pval:.3f}')
H_a: Decreasing - pvalue:0.020
_, pval = fit_model.test_heteroskedasticity('breakvar', alternative='two-sided')[0]
print(f'H_a: Two-sided - pvalue:{pval:.3f}')
H_a: Two-sided - pvalue:0.040

Summary of our tests

Independence:

  • ✅ ACF plot.

  • ✅ Ljung-Box test.

  • ✅ Durbin Watson test.

Normality:

  • ✅ Histogram/Density plot.

  • 🤔 QQ-plot

  • ❌ Jarque-Bera (reliable for large sample size).

  • ❌ Shapiro-Wilk (reliable for large sample size).

Heteroskedasticity

  • ❌ Heteroskedasticity test.

  • The tests are a bit inconclusive.

  • There is no strong evidence that the model is either very good or very bad.

  • It is probably wise to try other canidate models, e.g., ARMA(2,0,2), and repeat the tests.

ARMA Model Predictions#

  • Once the model is fit, we can use it to predict the test data.

  • The predictions come in a form of a distribution.

  • In other words, ARMA performs a probabilistic forecasting.

  • The mean (mode) of this distribution correspond to the most likely value and correspond to our forecast

  • The rest of the distribution can be used to compute confidence intervals.

pred_summary = fit_model.get_prediction(test.index[0], test.index[-1]).summary_frame()

plt.figure(figsize=(12, 4))
plt.plot(test.index, test, label='Ground Truth')
plt.plot(test.index, pred_summary['mean'], label='Forecast', linestyle='--')
plt.fill_between(test.index, pred_summary['mean_ci_lower'], pred_summary['mean_ci_upper'], 
                 color='orange', alpha=0.2, label='95% Confidence Interval')
plt.legend()
plt.tight_layout();
../../_images/2e9ef28c5e773d87fb131e0e7bd01efe9d5f0fde98e508ebf5a5e43c702c0e9d.png

ARIMA Model#

  • ARIMA stands for Auto Regressive Integrated Moving Average.

  • ARIMA models have three components:

    • AR model.

    • Integrated component (more on this shortly).

    • MA model.

  • The ARIMA model is denoted ARIMA(\(p, d, q\)).

    • \(p\) is the order of the AR model.

    • \(d\) is the number of times to difference the data.

    • \(q\) is the order of the MA model.

    • \(p\), \(d\), and \(q\) are nonnegative integers.

  • As we saw previously, differencing nonstationary time series data one or more times can make it stationary.

  • That’s the role of the integrated (I) component of ARIMA.

  • \(d\) is the number of times to perform a lag 1 difference on the data.

    • \(d=0\): no differencing.

    • \(d=1\): difference once.

    • \(d=2\): difference twice.

  • The ARMA model is suitable for stationary time series where the mean and variance do not change over time.

  • The ARIMA model effectively models non-stationary time series by differencing the data.

  • In practice, ARIMA makes the time series stationary before applying the ARMA model.

  • Let’s see it with an example.

# Generate synthetic stationary data with an ARMA(1,1) process
n = 250
ar_coeff = np.array([1, -0.7]) # The first value refers to lag 0 and is always 1. In addition, AR coeff are negated.
ma_coeff = np.array([1, 0.7])  # The first value refers to lag 0 and is always 1
arma_data = ss.tsa.arima_process.ArmaProcess(ar_coeff, ma_coeff).generate_sample(nsample=n, burnin=1000)
# Generate a synthetic non-stationary data (needs to be differenced twice to be stationary)
t = np.arange(n)
non_stationary_data = 0.05 * t**2 + arma_data  # Quadratic trend

fig, axes = plt.subplots(1, 3, figsize=(14, 4))
axes[0].plot(non_stationary_data)
axes[0].set_title('Original Data')
axes[1].plot(diff(non_stationary_data, k_diff=1))
axes[1].set_title('1st Differencing')
axes[2].plot(diff(non_stationary_data, k_diff=2))
axes[2].set_title('2nd Differencing')
plt.tight_layout();
../../_images/cc6ce031394a15edd832182bae77ae7134eb5645c1df0169d6441aa4fe4b68b6.png
# Fit models to stationary data
arma_model = ARIMA(arma_data[:-20], order=(1, 0, 1)).fit()
arima_model = ARIMA(arma_data[:-20], order=(1, 1, 1)).fit()

plt.figure(figsize=(12, 4))
plt.plot(arma_data[-50:], 'k', label='Stationary Data', linewidth=2)
plt.plot(arma_model.predict(200,250), label='ARMA Fit')
plt.plot(arima_model.predict(200, 250), label='ARIMA Fit')
plt.legend()
plt.title('Stationary Data')
plt.tight_layout();

print(len(arma_model.predict(10)))
220
../../_images/fef41086de32370a79df0b84ed64d33437b9e17d97bdd5bec8f838013bdc0652.png
# Fit models to non-stationary data
arma_model = ARIMA(non_stationary_data[:-20], order=(1, 0, 1)).fit()
arima_model = ARIMA(non_stationary_data[:-20], order=(1, 2, 1)).fit()

plt.figure(figsize=(12, 4))
plt.plot(non_stationary_data[-40:], 'k', label='Non-stationary Data', linewidth=2)
plt.plot(arma_model.predict(210,250), label='ARMA Fit')
plt.plot(arima_model.predict(210,250), label='ARIMA Fit')
plt.legend()
plt.title('Non-stationary Data')
plt.tight_layout();
/home/filippo/miniconda3/envs/sta2003_nb/lib/python3.10/site-packages/statsmodels/tsa/statespace/sarimax.py:966: UserWarning: Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.
  warn('Non-stationary starting autoregressive parameters'
../../_images/97902ee91dd9c414574132b3586d7863d16bfdaf17ca88567f05a0ad2fb578fd.png

SARIMA#

  • To apply ARMA and ARIMA, we must remove the seasonal component.

  • After computing the predictions we had to put the seasonal component back.

  • It would be convenient to directly work on data with seasonality.

  • SARIMA is an extension of ARIMA that includes seasonal terms.

  • The model is specified as SARIMA \((p, d, q) \times (P, D, Q, s)\):

    • Regular ARIMA components \((p, d, q)\).

    • Seasonal components \((P, D, Q, s)\) where:

      • \(P\): Seasonal autoregressive order.

      • \(D\): Seasonal differencing order.

      • \(Q\): Seasonal moving average order.

      • \(s\): Number of time steps for a single seasonal period.

How to select the values \(s, P, D, Q\)?

  • \(s\):

    • Is the main seasonality in the data.

    • We already know how to find it.

  • \(P\) and \(Q\):

    • A spike at \(s\)-th lag (and potentially multiples of \(s\)) should be present in the ACF/PACF plots.

    • For example, if \(s = 12\), there could be spikes at \((s*n)^{th}\) lags too.

    • Pick out the lags with largest spikes as candidates for \(P\) or \(Q\).

  • \(D\):

    • Is the number of seasonal differencing required to make the time series stationary.

    • Is often determined by trial and error or by examining the seasonally differenced data.

💡 Rule of thumb

  • Before selecting \(P\) and \(Q\), ensure that the series is seasonally stationary by applying seasonal differencing if needed (\(D\)).

  • Look the ACF plot to identify the seasonal moving average order \(Q\).

    • Look for significant autocorrelations at seasonal lags (multiples of \(s\)).

    • If the ACF plot shows a slow decay in the seasonal lags, it suggests a need for a seasonal MA component (\(Q\)).

  • Look at tge PACF plot to identify the seasonal autoregressive order \(P\).

    • Look for significant spikes at multiples of the seasonality \(s\).

    • A sharp cut-off in the PACF at a seasonal lag suggests the number of AR terms (\(P\)) needed.

diff_ts = monthly_temp['temp'].diff(periods=12).dropna()

fig = plt.figure(figsize=(10, 6))
ax1 = plt.subplot2grid((2, 2), (0, 0), colspan=2)
ax1.plot(diff_ts)
ax2 = plt.subplot2grid((2, 2), (1, 0))
plot_pacf(diff_ts, lags=80, ax=ax2)
ax3 = plt.subplot2grid((2, 2), (1, 1))
plot_acf(diff_ts, lags=80, ax=ax3)
plt.tight_layout();
../../_images/6dcd4f3b7006c8a6343e7c2a3cc4b835d266b826673c6cf5b7303592cf2b55e4.png
# fit SARIMA monthly based on helper plots
sar = ss.tsa.statespace.sarimax.SARIMAX(monthly_temp[:750].temp, 
                                order=(2,1,2), 
                                seasonal_order=(0,1,1,12), 
                                trend='c').fit(disp=False)
sar.summary()
SARIMAX Results
Dep. Variable: temp No. Observations: 750
Model: SARIMAX(2, 1, 2)x(0, 1, [1], 12) Log Likelihood -2026.063
Date: Tue, 07 May 2024 AIC 4066.127
Time: 16:24:10 BIC 4098.345
Sample: 01-01-1907 HQIC 4078.551
- 06-01-1969
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept -6.67e-05 0.000 -0.374 0.709 -0.000 0.000
ar.L1 -0.7658 0.170 -4.496 0.000 -1.100 -0.432
ar.L2 0.1790 0.042 4.262 0.000 0.097 0.261
ma.L1 -0.0543 0.182 -0.299 0.765 -0.411 0.302
ma.L2 -0.9433 0.173 -5.447 0.000 -1.283 -0.604
ma.S.L12 -0.9816 0.039 -24.909 0.000 -1.059 -0.904
sigma2 13.3628 0.955 13.993 0.000 11.491 15.234
Ljung-Box (L1) (Q): 0.03 Jarque-Bera (JB): 202.05
Prob(Q): 0.86 Prob(JB): 0.00
Heteroskedasticity (H): 0.77 Skew: -0.55
Prob(H) (two-sided): 0.04 Kurtosis: 5.32


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
sar.plot_diagnostics(figsize=(14, 8))
plt.tight_layout();
../../_images/5d27bd59df152c5259668059edbd9c146219826a050b8ce6e4bd8c5ed605c169.png
monthly_temp['forecast'] = sar.predict(start = 750, end= 792, dynamic=False)  
monthly_temp[730:][['temp', 'forecast']].plot(figsize=(12, 3))
plt.tight_layout();
print(f"MSE: {mse(monthly_temp['temp'][-42:],monthly_temp['forecast'][-42:]):.2f}")
MSE: 9.21
../../_images/f41988ea104c852f8ec21eb03ee9f06c618327260c0e74dabea542b4350efa14.png

AutoARIMA#

  • At this point it should be clear that identifying the optimal SARIMA model is difficult.

  • It requires careful analysis, trial and errors, and some experience.

  • The following cheatsheet summarizes some rules of thumb to select the model.

💡 Cheatsheet: coefficients setting

ACF Shape

Indicated Model

Exponential, decaying to zero

AR model. Use the PACF to identify the order of the ARe model.

Alternating positive and negative, decaying to zero

AR model. Use the PACF to identify the order.

One or more spikes, rest are essentially zero

MA model, order identified by where plot becomes zero.

Decay, starting after a few lags

Mixed AR and MA (ARMA) model.

All zero or close to zero

Data are essentially random.

High values at fixed intervals

Include seasonal AR term.

No decay to zero

Series is not stationary.

  • An alternative to manual model selection is to use automated procedures.

  • Here enters AutoARIMA.

  • AutoARIMA requires you to specify the maximum range of values to try.

  • Afterwards, it tries to find the best confgiuration among the possible ones.

  • See here for a complete list and description of the options available.

# Split the data into train and test sets
train, test = monthly_temp[:750].temp, monthly_temp[750:].temp

# Use auto_arima to find the best ARIMA model
model = pm.auto_arima(train, 
                      start_p=0, start_q=0,
                      test='adf',       # Use adftest to find optimal 'd'
                      max_p=2, max_q=2, # Maximum p and q
                      m=12,             # Seasonality
                      start_P=0, start_Q=0,
                      max_P=2, max_Q=2, # Maximum p and q
                      seasonal=True,    # Seasonal ARIMA
                      d=None,           # Let model determine 'd'
                      D=1,              # Seasonal difference
                      trace=True,       # Print status on the fits
                      error_action='ignore',  
                      suppress_warnings=True, 
                      stepwise=True)    # Stepwise search to find the best model
Performing stepwise search to minimize aic
 ARIMA(0,0,0)(0,1,0)[12] intercept   : AIC=4558.569, Time=0.03 sec
 ARIMA(1,0,0)(1,1,0)[12] intercept   : AIC=4318.918, Time=0.26 sec
 ARIMA(0,0,1)(0,1,1)[12] intercept   : AIC=inf, Time=0.76 sec
 ARIMA(0,0,0)(0,1,0)[12]             : AIC=4556.588, Time=0.04 sec
 ARIMA(1,0,0)(0,1,0)[12] intercept   : AIC=4531.290, Time=0.07 sec
 ARIMA(1,0,0)(2,1,0)[12] intercept   : AIC=4230.055, Time=1.02 sec
 ARIMA(1,0,0)(2,1,1)[12] intercept   : AIC=inf, Time=2.09 sec
 ARIMA(1,0,0)(1,1,1)[12] intercept   : AIC=inf, Time=0.65 sec
 ARIMA(0,0,0)(2,1,0)[12] intercept   : AIC=4256.241, Time=0.70 sec
 ARIMA(2,0,0)(2,1,0)[12] intercept   : AIC=4230.500, Time=1.35 sec
 ARIMA(1,0,1)(2,1,0)[12] intercept   : AIC=4228.386, Time=1.46 sec
 ARIMA(1,0,1)(1,1,0)[12] intercept   : AIC=4317.592, Time=0.40 sec
 ARIMA(1,0,1)(2,1,1)[12] intercept   : AIC=inf, Time=3.09 sec
 ARIMA(1,0,1)(1,1,1)[12] intercept   : AIC=inf, Time=1.21 sec
 ARIMA(0,0,1)(2,1,0)[12] intercept   : AIC=4232.657, Time=0.83 sec
 ARIMA(2,0,1)(2,1,0)[12] intercept   : AIC=4230.040, Time=2.42 sec
 ARIMA(1,0,2)(2,1,0)[12] intercept   : AIC=4229.928, Time=1.42 sec
 ARIMA(0,0,2)(2,1,0)[12] intercept   : AIC=4232.612, Time=1.06 sec
 ARIMA(2,0,2)(2,1,0)[12] intercept   : AIC=4231.411, Time=2.53 sec
 ARIMA(1,0,1)(2,1,0)[12]             : AIC=4226.467, Time=0.49 sec
 ARIMA(1,0,1)(1,1,0)[12]             : AIC=4315.652, Time=0.19 sec
 ARIMA(1,0,1)(2,1,1)[12]             : AIC=inf, Time=2.39 sec
 ARIMA(1,0,1)(1,1,1)[12]             : AIC=inf, Time=0.71 sec
 ARIMA(0,0,1)(2,1,0)[12]             : AIC=4230.780, Time=0.22 sec
 ARIMA(1,0,0)(2,1,0)[12]             : AIC=4228.164, Time=0.26 sec
 ARIMA(2,0,1)(2,1,0)[12]             : AIC=4228.121, Time=0.79 sec
 ARIMA(1,0,2)(2,1,0)[12]             : AIC=4228.009, Time=0.38 sec
 ARIMA(0,0,0)(2,1,0)[12]             : AIC=4254.407, Time=0.26 sec
 ARIMA(0,0,2)(2,1,0)[12]             : AIC=4230.723, Time=0.41 sec
 ARIMA(2,0,0)(2,1,0)[12]             : AIC=4228.598, Time=0.39 sec
 ARIMA(2,0,2)(2,1,0)[12]             : AIC=4229.492, Time=0.90 sec

Best model:  ARIMA(1,0,1)(2,1,0)[12]          
Total fit time: 28.786 seconds
# Summarize the model
print(model.summary())

# Frecast future values
monthly_temp['forecast'] = model.predict(n_periods=len(test)) 
monthly_temp[730:][['temp', 'forecast']].plot(figsize=(12, 3))
plt.tight_layout();
print(f"MSE: {mse(monthly_temp['temp'][-42:],monthly_temp['forecast'][-42:]):.2f}")
                                      SARIMAX Results                                      
===========================================================================================
Dep. Variable:                                   y   No. Observations:                  750
Model:             SARIMAX(1, 0, 1)x(2, 1, [], 12)   Log Likelihood               -2108.234
Date:                             Tue, 07 May 2024   AIC                           4226.467
Time:                                     16:24:39   BIC                           4249.487
Sample:                                 01-01-1907   HQIC                          4235.344
                                      - 06-01-1969                                         
Covariance Type:                               opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.5814      0.144      4.027      0.000       0.298       0.864
ma.L1         -0.4101      0.162     -2.528      0.011      -0.728      -0.092
ar.S.L12      -0.6818      0.029    -23.405      0.000      -0.739      -0.625
ar.S.L24      -0.3479      0.030    -11.737      0.000      -0.406      -0.290
sigma2        17.5740      0.668     26.315      0.000      16.265      18.883
===================================================================================
Ljung-Box (L1) (Q):                   0.03   Jarque-Bera (JB):               145.56
Prob(Q):                              0.87   Prob(JB):                         0.00
Heteroskedasticity (H):               0.76   Skew:                            -0.39
Prob(H) (two-sided):                  0.04   Kurtosis:                         5.03
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
MSE: 11.12
../../_images/af8bd604eee9692d07aed16c173df60b5b6eb79f4a7dc305df7543662e54d258.png

AutoARIMA or not AutoARIMA?#

While being very convenient, like all automated procedures auto_arima comes with drawbacks.

  1. auto_arima can be computationally expensive, especially for large datasets and when exploring a wide range of models.

  1. Automated model selection lacks the qualitative insights a human might bring to the modeling process.

    • These, include understanding business cycles, external factors, or anomalies in the data.

  1. The defaults in auto_arima may not be optimal for all time series data.

    • The range of values to explore should be adjusted properly each time.

    • This might become almost as tricky as doing manual model selection.

  1. auto_arima requires a sufficiently long time series to accurately identify patterns and seasonality.

  1. The selection of the best model is typically based on statistical criteria such as AIC or BIC.

    • These, might not always align with practical performance metrics such as MSE.


Summary#

  1. Autoregressive Moving Average (ARMA) models.

  2. Autoregressive Integrated Moving Average (ARIMA) models.

  3. SARIMA models (ARIMA model for data with seasonality).

  4. Selecting the best model.


Exercise#

  • Look at sensor data that tracks atmospheric CO2 from continuous air samples at Mauna Loa Observatory in Hawaii. This data includes CO2 samples from MAR 1958 to DEC 1980.

co2 = pd.read_csv('https://zenodo.org/records/10951538/files/arima_co2.csv?download=1', 
                  header = 0,
                  names = ['idx', 'co2'],
                  skipfooter = 2)

# convert the column idx into a datetime object and set it as the index
co2['idx'] = pd.to_datetime(co2['idx'])
co2.set_index('idx', inplace=True)

# Rmove the name "idx" from the index column
co2.index.name = None
co2
co2
1965-01-01 319.32
1965-02-01 320.36
1965-03-01 320.82
1965-04-01 322.06
1965-05-01 322.17
... ...
1980-08-01 337.19
1980-09-01 335.49
1980-10-01 336.63
1980-11-01 337.74
1980-12-01 338.36

192 rows × 1 columns

  • Determine the presence of main trend and seasonality in the data.

  • Determine if the data are stationary.

  • Split the data in train (90%) and test (10%).

  • Find a set of SARIMAX candidate models by looking at the ACF and PACF.

  • Perform a grid search on the model candidates.

  • Select the best models, based on performance metrics, model complexity, and normality of the residuals.

  • Compare the best model you found with the one from autoarima.