Deep Learning for Temporal Data

01

Time Turns Observations into Signals

Introductory examples

NDVI, floods, and electricity demand.

Earth observation example

NDVI time series

Normalized Difference Vegetation Index (NDVI) quantifies vegetation by measuring the difference between near-infrared (which vegetation strongly reflects) and red light (which vegetation absorbs).

Phenologygreen-up, peak season, senescence

Productivityvegetation vigor and seasonal amplitude

Agriculturecrop calendars, management, illegal ploughing

Disturbancedeforestation, wild fires, abrupt vegetation loss

NDVI time series reveal vegetation dynamics that single images miss.

NDVI change over South Africa across a single year.

NDVI context adapted from White, Introduction to Spatial Data in R, 2022.

Change and events

Flood mapping is a temporal task

Before: March 23, 2015

After: May 26, 2015

NASA Earth Observatory images by Joshua Stevens, using Landsat data from the U.S. Geological Survey.

When should we use deep learning?

Nonlinear relationships, simple baselines

Load consumption profile on a backbone of the electricity network of Rome. The temperature-load relation is nonlinear; copying yesterday is the first baseline to beat.

Temperature-load scatter and previous-day electricity baseline

Bianchi et al., "Recurrent neural networks for short-term load forecasting: an overview and comparative analysis", 2017.

02

What Should a Model Remember?

Deep learning for temporal data

Definitions and approach families.

Main idea

Temporal data is about usable memory

The signal is often not in one observation, but in how observations change.

Future: what forecast horizon?
Past: how far back should we look?
Missingness: gaps, irregular samples
Leakage: keep future information hidden

Multi-step time-series forecasting schematic

Past context is useful only relative to a prediction horizon.

Formal setup

Window and horizon

\(\mathcal{F}_{\theta}\) ( \(X_{t-W:t}\) , \(m_{t-W:t}\) , \(u_{t-W:t+H}\) , \(a\) ) → \(\widehat{X}_{t:t+H}\)

Input window \(X_{t-W:t}\) last W steps

→

Prediction horizon \(\widehat{X}_{t:t+H}\) next H steps

X_t-W:tobserved window

m_t-W:tvalidity mask

u_t-W:t+Hknown covariates

astatic metadata

Learning across collections

Local and global predictors

Local models

Tailored, data inefficient.

\(\widehat{X}^{\,i}_{t:t+H}=\mathcal{F}_{\theta^i}(X^{\,i}_{t-W:t})\)

Global models

Shared, scalable.

\(\widehat{X}^{\,i}_{t:t+H}=\mathcal{F}_{\theta}(X^{\,i}_{t-W:t}, a^{\,i})\)

How should the model represent the past?

fixed lags or windows
recurrent hidden state
learned temporal filters
attention or state-space scan

Cini et al., "Graph Deep Learning for Spatiotemporal Time Series", 2023.

03

Learn from a Fixed Slice of the Past

Windowed approaches

Sliding window and MLP.

Fixed past context

Sliding windows

\[ \widehat{X}_{t:t+H}=\mathcal{F}_{\theta}\left(X_{t-W:t}\right) \]

AR models
Regression trees / forests
SVMs
MLPs

Animated sliding window over a sinusoidal time series

Bianchi, Time Series Analysis with Python, online handbook, 2024.

Nonlinear windowed approach

MLP: a trainable, nonlinear function approximator

Task: Learn the map \(X_{t-W:t} \mapsto \widehat{X}_{t+H}\)
Model: hidden layers learn nonlinear relationships among lags and variables
Batching: many window-target pairs are stacked and processed in parallel

Windowed approaches limitations

What should \( W \) be?

Different sentences require different memory lengths

Too short: miss evidence
Too long: add noise
Memory becomes hand-designed

Can the model learn its own memory?

04

Variable context windows

Recurrent models

RNNs, LSTMs, and Reservoir Computing.

Recurrent Neural Networks

A hidden state carries context

Read \(x_t\) consume the next observation in the sequence

Update \(h_t\) combine current input with the previous memory \(h_{t-1}\)

Predict \(y_t\) decode the updated state into an output

A single update rule is shared across time. The hidden state \(h_t\) is the compact memory passed to the next step.

Bianchi, Time Series Analysis with Python, online handbook, 2024.

Training

Backpropagation through time

Unroll in time
Share weights
Backpropagate through the chain

Vanishing and exploding gradients make the training of RNNs challenging.

RNNs with gates control memory update and information flows

LSTM with gates

Forget gate

Decide which parts of the previous memory survive.

\(f_t = \sigma(W_f[h_{t-1}, x_t] + b_f)\) \(f_t \odot C_{t-1}\)

Input gate

Choose how much new candidate content to write.

\(i_t = \sigma(W_i[h_{t-1}, x_t] + b_i)\) \(\hat{C}_t = \tanh(W_C[h_{t-1}, x_t] + b_C)\) \(i_t \odot \hat{C}_t\)

Cell state and output

Combine retained and written memory, then expose useful state.

\(C_t = f_t \odot C_{t-1} + i_t \odot \hat{C}_t\) \(o_t = \sigma(W_o[h_{t-1}, x_t] + b_o)\) \(h_t = o_t \odot \tanh(C_t)\)

Hochreiter and Schmidhuber, "Long short-term memory", 1997.

Reservoir computing and Echo State Networks

Randomized RNNs

Randomized recurrent (\(W_h\)) and input (\(W_i\)) weights
Train only the readout
Large reservoir produces rich temporal features
Readout select only the features useful for the task

Useful when training a standard RNN is too expensive or when data are not enough.

Bianchi et al., "Reservoir computing approaches for representation and classification of multivariate time series", 2020.

Trade-offs: lack of training make hyperparameters critical

Reservoir configuration

Reservoir dynamics under different spectral radii

Spectral radius: controls the internal dynamics of the Reservoir and the amount of memory
Low values: Contractive regime, short memory, no expressivity
High values: Chaotic regime, overfits and cannot generalize

05

Scan for Reusable Temporal Motifs

Temporal Convolutions

1-D filters, receptive fields, and causal convolutions.

1-D convolution

A temporal filter scans for reusable motifs

Image analogy

A small kernel scans space and reuses the same detector at every location.

Temporal filter

The same learned kernel scans time to detect slopes, peaks, pulses, or changes.

The same learned kernel is applied at every time step.

\[ z_t=\sum_{k=0}^{K-1} w_k\,x_{t-k} \]

Weight sharing: same detector everywhere
Feature map: response over time
Filter bank: many motifs in parallel

Temporal convolutional networks

Stacked filters build long temporal features

A TCN grows memory by composing small filters across layers.

\[ R = 1+\sum_{\ell=1}^{L}(K_\ell-1)d_\ell \]

Layer 1: local slopes and pulses
Deeper layers: longer temporal patterns
Dilation: wider context without dense kernels

The receptive field is the maximum history that can affect one output.

Keeping the future hidden

Causal convolutions

For forecasting, every output must ignore observations from the future.

Left padding: preserve sequence length
Right alignment: output belongs to current time
No symmetric context: future values are forbidden

Preprocessing, normalization, and imputation must also respect time.

Van Den Oord et al., "Wavenet: A generative model for raw audio", 2016.

Temporal filters detect patterns and changes

Filters example

A temporal CNN acts learns a bank of filters matching temporal patterns.

Increments: The input signal goes up
Peak / plateau: crop calendar evidence
Drop: harvest, flood, burn, disturbance

Good fit when local temporal shapes are informative and transferable across locations.

Example temporal motifs

06

Retrieve the Relevant Past

Transformers

Attention, tokens, and validity.

Attention retrieves information from the observed past

Self-attention

Each token asks which other observations are relevant.

\[ \mathrm{Attn}(Q,K,V;M) = \mathrm{softmax} \left( \frac{QK^\top}{\sqrt{d}} + M \right)V \]

Query/key: score relevance
Value: information to retrieve
Mask: hide clouds or future steps

Attention selects observations by content; time encodings tell it when they occurred.

Vaswani et al., "Attention Is All You Need", 2017.

Repeated self-attention turns observations into contextual tokens

Temporal Transformer

Observation tokenseach observation becomes a distinct token

Self-attentionrepeated blocks let tokens gather information from relevant dates

Contextual tokensthe final token set is pooled to make a unified prediction

The key message is simple. Step 1: each observation becomes a token with signal and time information. Step 2: repeated Transformer blocks apply self-attention and an MLP. Self-attention lets each token look at other relevant dates. Step 3: after several blocks, each token becomes contextual: it still refers to one date, but now includes information gathered from other useful dates. A final pooling or CLS readout makes the sequence prediction. Compared with previous models: MLP uses fixed lags, RNN compresses into a running state, TCN scans local motifs, Transformer builds direct token-to-token interactions.

Temporal sequences

Attention needs time and availability information

A temporal Transformer does not only need observations. It also needs to know when they happened and whether they should be used.

Signal: measurements, indices, or learned features
Time: date, lag, phase, or time gap
Availability: missing, padded, unreliable, or forbidden tokens

Example temporal sequence

Jan Feb Margap Aprlow May Junkey Julkey Auggap Seplow Oct Nov Dec

The model ignores unavailable dates and can focus attention on the most informative ones.

07

Memory as Learned Dynamics

State space models

Continuous-time models, S4, and Mamba.

Memory as continuous dynamics

Learning rates of change

Instead of storing a finite window, model a hidden state that evolves in continuous time.

\(\frac{d s(t)}{dt}=f_{\theta}\!\left(s(t),u(t),t\right), \qquad y(t)=g_{\theta}\!\left(s(t)\right)\)

\(s(t)\): compact latent memory
\(f_\theta\): learned dynamics driven by input \(u(t)\)
\(g_\theta\): readout producing \(y(t)\) at observation times

Continuous evolution, discrete observations

Continuous dynamics evolve across irregular gaps; updates happen at observation times.

Chen et al., "Neural Ordinary Differential Equations", 2018; Rubanova et al., "Latent ODEs for Irregularly-Sampled Time Series", 2019.

Sampling turns dynamics into state updates

State space models

Observe the continuous system only at times \(t_i\).
The elapsed time \(\Delta t_i=t_i-t_{i-1}\) determines how far the state evolves.

continuous dynamics \(\dot{s}(t)=A s(t)+B u(t),\quad y(t)=C s(t)+D u(t)\)

sampled update \(s_i=\bar{A}(\Delta t_i)s_{i-1}+\bar{B}(\Delta t_i)u_i,\quad y_i=C s_i+D u_i\)

\(\Delta t_i\): time since the previous observation
\(\bar{A}(\Delta t_i)\): state transition over that gap
\(\bar{B}(\Delta t_i)\): input effect accumulated over that gap

State updates at observed times

After discretization, the sequence is processed by scanning \(s_0 \rightarrow s_1 \rightarrow \cdots\).

\(A,B,C,D\) define state dynamics, input injection, readout, and skip connection; the layer learns these dynamics.

Gu et al., "Combining recurrent, convolutional, and continuous-time models with linear state space layers", 2021.

Modern SSMs make state memory practical

S4 and Mamba

All state-space models carry a compact hidden state. S4 makes this efficient for long sequences; Mamba makes it selective.

S4: Gu, Goel, and Ré, 2021. Mamba: Gu and Dao, 2023.

08

Wrapping things up

Conclusion

Takeaways and references

Takeaways

Which mechanism fits which problem?

Architecture	Memory mechanism	Parallelism	Good fit	Watch out
Window + MLP	Fixed explicit lags	High	Baselines; engineered features	Choosing window size \(W\)
RNN/GRU/LSTM	Learned recurrent state	Low across time	Streaming; latent stages	Hard long training
Reservoir/ESN	Random dynamical features	Medium	Limited labels; fast readouts	Tuning sensitive
TCN	Receptive field by depth/dilation	High	Recurring patterns; local motifs	Context by design
Transformer	Direct attention over observed dates	High	Irregular and missing data	Memory cost
SSM/Mamba	State space scan with selective memory	High / streaming	Long sequences; efficient context	Newer ecosystem

Sources and credits

References

Bianchi, Time Series Analysis with Python, online handbook, 2024.
Bianchi, Maiorino, Kampffmeyer, Rizzi, and Jenssen, "Recurrent neural networks for short-term load forecasting: an overview and comparative analysis", Springer, 2017.
Bianchi, Scardapane, Løkse, and Jenssen, "Reservoir computing approaches for representation and classification of multivariate time series", IEEE Transactions on Neural Networks and Learning Systems, 32(5), 2169-2179, 2020.
White, Introduction to Spatial Data in R, SAEON GSN workshop, 2022.
Hochreiter and Schmidhuber, "Long short-term memory", Neural Computation, 9(8), 1735-1780, MIT Press, 1997.
Benidis et al., "Deep Learning for Time Series Forecasting", ACM Computing Surveys, 2022.
Jaeger, "The echo state approach to analysing and training recurrent neural networks", 2001.
Bai, Kolter, and Koltun, "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling", 2018.
Van Den Oord et al., "Wavenet: A generative model for raw audio", arXiv preprint arXiv:1609.03499 12.1, 2016.
Vaswani et al., "Attention Is All You Need", NeurIPS, 2017.
Chen, Rubanova, Bettencourt, and Duvenaud, "Neural Ordinary Differential Equations", NeurIPS, 2018.
Rubanova, Chen, and Duvenaud, "Latent ODEs for Irregularly-Sampled Time Series", NeurIPS, 2019.
Gu et al., "Combining recurrent, convolutional, and continuous-time models with linear state space layers", NeurIPS, 2021.
Gu et al., "Efficiently Modeling Long Sequences with Structured State Spaces", 2021.
Gu and Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces", 2023.
Pelletier, Webb, and Petitjean, "Temporal Convolutional Neural Network for the Classification of Satellite Image Time Series", Remote Sensing, 2019.
Garnot et al., "Satellite Image Time Series Classification with Pixel-Set Encoders and Temporal Self-Attention", CVPR, 2020.
Cini, Marisca, and Zambon, "Graph Deep Learning for Spatiotemporal Time Series", ECML/PKDD tutorial, 2023.
ESA EO Science for Society, "Dynamic Sentinel-2 mosaic service", 2022.
NASA Landsat Science, "Chaco Region Paraguay Time Series", 2025.
NASA Earth Observatory, "Flooding of The Red River", 2015.

Deep Learningfor Temporal Data

Introductory examples

NDVI time series

Flood mapping is a temporal task

Nonlinear relationships, simple baselines

Deep learning for temporal data

Temporal data is about usable memory

Window and horizon

Local and global predictors

Local models

Global models

Windowed approaches

Sliding windows

MLP: a trainable, nonlinear function approximator

What should \( W \) be?

Recurrent models

A hidden state carries context

Backpropagation through time

LSTM with gates

Forget gate

Input gate

Cell state and output

Randomized RNNs

Reservoir configuration

Temporal Convolutions

A temporal filter scans for reusable motifs

Image analogy

Temporal filter

Stacked filters build long temporal features

Causal convolutions

Filters example

Example temporal motifs

Transformers

Self-attention

Temporal Transformer

Attention needs time and availability information

Example temporal sequence

State space models

Learning rates of change

Continuous evolution, discrete observations

State space models

State updates at observed times

S4 and Mamba

Conclusion

Which mechanism fits which problem?

References