Machine Learning seminars

Deep Learningfor Temporal Data

How neural networks represent the past.

NASA Landsat Science, "Chaco Region Paraguay Time Series", 2025.

  • prev / next
  • vertical slides
  • Space fragment
  • Esc overview
  • S notes
01
Time Turns Observations into Signals

Introductory examples

NDVI, floods, and electricity demand.

Earth observation example

NDVI time series

Normalized Difference Vegetation Index (NDVI) quantifies vegetation by measuring the difference between near-infrared (which vegetation strongly reflects) and red light (which vegetation absorbs).

Phenologygreen-up, peak season, senescence
Productivityvegetation vigor and seasonal amplitude
Agriculturecrop calendars, management, illegal ploughing
Disturbancedeforestation, wild fires, abrupt vegetation loss
NDVI time series from Sentinel-2 images

NDVI time series reveal vegetation dynamics that single images miss.

Animated MODIS NDVI change over South Africa across a single year

NDVI change over South Africa across a single year.

NDVI context adapted from White, Introduction to Spatial Data in R, 2022.

Change and events

Flood mapping is a temporal task

Red River before flooding Red River after flooding

Before: March 23, 2015

After: May 26, 2015

NASA Earth Observatory images by Joshua Stevens, using Landsat data from the U.S. Geological Survey.

When should we use deep learning?

Nonlinear relationships, simple baselines

Load consumption profile on a backbone of the electricity network of Rome. The temperature-load relation is nonlinear; copying yesterday is the first baseline to beat.

Temperature-load scatter and previous-day electricity baseline

Bianchi et al., "Recurrent neural networks for short-term load forecasting: an overview and comparative analysis", 2017.

02
What Should a Model Remember?

Deep learning for temporal data

Definitions and approach families.

Main idea

Temporal data is about usable memory

The signal is often not in one observation, but in how observations change.

  • Future: what forecast horizon?
  • Past: how far back should we look?
  • Missingness: gaps, irregular samples
  • Leakage: keep future information hidden
Multi-step time-series forecasting schematic

Past context is useful only relative to a prediction horizon.

Formal setup

Window and horizon

\(\mathcal{F}_{\theta}\) ( \(X_{t-W:t}\) , \(m_{t-W:t}\) , \(u_{t-W:t+H}\) , \(a\) ) \(\widehat{X}_{t:t+H}\)
Input window \(X_{t-W:t}\) last W steps
Prediction horizon \(\widehat{X}_{t:t+H}\) next H steps
Xt-W:tobserved window
mt-W:tvalidity mask
ut-W:t+Hknown covariates
astatic metadata
Learning across collections

Local and global predictors

Local models

Local model schematic

Tailored, data inefficient.

\(\widehat{X}^{\,i}_{t:t+H}=\mathcal{F}_{\theta^i}(X^{\,i}_{t-W:t})\)

Global models

Global model schematic

Shared, scalable.

\(\widehat{X}^{\,i}_{t:t+H}=\mathcal{F}_{\theta}(X^{\,i}_{t-W:t}, a^{\,i})\)

How should the model represent the past?

  • fixed lags or windows
  • recurrent hidden state
  • learned temporal filters
  • attention or state-space scan

Cini et al., "Graph Deep Learning for Spatiotemporal Time Series", 2023.

03
Learn from a Fixed Slice of the Past

Windowed approaches

Sliding window and MLP.

Fixed past context

Sliding windows

\[ \widehat{X}_{t:t+H}=\mathcal{F}_{\theta}\left(X_{t-W:t}\right) \]
  • AR models
  • Regression trees / forests
  • SVMs
  • MLPs
Animated sliding window over a sinusoidal time series

Bianchi, Time Series Analysis with Python, online handbook, 2024.

Nonlinear windowed approach

MLP: a trainable, nonlinear function approximator

MLP windowed forecasting schematic
  • Task: Learn the map \(X_{t-W:t} \mapsto \widehat{X}_{t+H}\)
  • Model: hidden layers learn nonlinear relationships among lags and variables
  • Batching: many window-target pairs are stacked and processed in parallel
Windowed approaches limitations

What should \( W \) be?

Different sentences require different memory lengths
  • Too short: miss evidence
  • Too long: add noise
  • Memory becomes hand-designed

Can the model learn its own memory?

04
Variable context windows

Recurrent models

RNNs, LSTMs, and Reservoir Computing.

Recurrent Neural Networks

A hidden state carries context

Read \(x_t\) consume the next observation in the sequence
Update \(h_t\) combine current input with the previous memory \(h_{t-1}\)
Predict \(y_t\) decode the updated state into an output

A single update rule is shared across time. The hidden state \(h_t\) is the compact memory passed to the next step.

Animated recurrent neural network

Bianchi, Time Series Analysis with Python, online handbook, 2024.

Training

Backpropagation through time

Backpropagation through time diagram
  • Unroll in time
  • Share weights
  • Backpropagate through the chain

Vanishing and exploding gradients make the training of RNNs challenging.

RNNs with gates control memory update and information flows

LSTM with gates

Forget gate animation

Forget gate

Decide which parts of the previous memory survive.

\(f_t = \sigma(W_f[h_{t-1}, x_t] + b_f)\) \(f_t \odot C_{t-1}\)
Input gate animation

Input gate

Choose how much new candidate content to write.

\(i_t = \sigma(W_i[h_{t-1}, x_t] + b_i)\) \(\hat{C}_t = \tanh(W_C[h_{t-1}, x_t] + b_C)\) \(i_t \odot \hat{C}_t\)
Cell state update animation

Cell state and output

Combine retained and written memory, then expose useful state.

\(C_t = f_t \odot C_{t-1} + i_t \odot \hat{C}_t\) \(o_t = \sigma(W_o[h_{t-1}, x_t] + b_o)\) \(h_t = o_t \odot \tanh(C_t)\)

Hochreiter and Schmidhuber, "Long short-term memory", 1997.

Reservoir computing and Echo State Networks

Randomized RNNs

  • Randomized recurrent (\(W_h\)) and input (\(W_i\)) weights
  • Train only the readout
  • Large reservoir produces rich temporal features
  • Readout select only the features useful for the task
Animated reservoir computing schematic

Useful when training a standard RNN is too expensive or when data are not enough.

Bianchi et al., "Reservoir computing approaches for representation and classification of multivariate time series", 2020.

Trade-offs: lack of training make hyperparameters critical

Reservoir configuration

Reservoir dynamics under different spectral radii
  • Spectral radius: controls the internal dynamics of the Reservoir and the amount of memory
  • Low values: Contractive regime, short memory, no expressivity
  • High values: Chaotic regime, overfits and cannot generalize
05
Scan for Reusable Temporal Motifs

Temporal Convolutions

1-D filters, receptive fields, and causal convolutions.

1-D convolution

A temporal filter scans for reusable motifs

Image analogy

A small kernel scans space and reuses the same detector at every location.

image patch feature map

Temporal filter

The same learned kernel scans time to detect slopes, peaks, pulses, or changes.

shared kernel, K = 3 t1 t2 t3 t4 t5 t6 t7 t8 zt w0 w1 w2

The same learned kernel is applied at every time step.

\[ z_t=\sum_{k=0}^{K-1} w_k\,x_{t-k} \]
  • Weight sharing: same detector everywhere
  • Feature map: response over time
  • Filter bank: many motifs in parallel
Temporal convolutional networks

Stacked filters build long temporal features

Layer 3 Layer 2 Layer 1 Input x1 x2 x3 x4 x5 x6 x7 x8 2 2 2 2 4 4 8

A TCN grows memory by composing small filters across layers.

\[ R = 1+\sum_{\ell=1}^{L}(K_\ell-1)d_\ell \]
  • Layer 1: local slopes and pulses
  • Deeper layers: longer temporal patterns
  • Dilation: wider context without dense kernels

The receptive field is the maximum history that can affect one output.

Keeping the future hidden

Causal convolutions

For forecasting, every output must ignore observations from the future.

  • Left padding: preserve sequence length
  • Right alignment: output belongs to current time
  • No symmetric context: future values are forbidden
causal kernel K = 3 0 0 t1 t2 t3 t4 t5 t6 t7 zt future excluded

Preprocessing, normalization, and imputation must also respect time.

Van Den Oord et al., "Wavenet: A generative model for raw audio", 2016.

Temporal filters detect patterns and changes

Filters example

A temporal CNN acts learns a bank of filters matching temporal patterns.

  • Increments: The input signal goes up
  • Peak / plateau: crop calendar evidence
  • Drop: harvest, flood, burn, disturbance

Good fit when local temporal shapes are informative and transferable across locations.

Example temporal motifs

filter scanning signals feature map activations Increments Peak/plateau Drop
06
Retrieve the Relevant Past

Transformers

Attention, tokens, and validity.

Attention retrieves information from the observed past

Self-attention

Each token asks which other observations are relevant.

\[ \mathrm{Attn}(Q,K,V;M) = \mathrm{softmax} \left( \frac{QK^\top}{\sqrt{d}} + M \right)V \]
  • Query/key: score relevance
  • Value: information to retrieve
  • Mask: hide clouds or future steps
tokens in time context for t t-5 t-4 t-3 t-2 t-1 t 0.20 0.55 0.25 weighted temporal context

Attention selects observations by content; time encodings tell it when they occurred.

Vaswani et al., "Attention Is All You Need", 2017.

Repeated self-attention turns observations into contextual tokens

Temporal Transformer

Observation tokenseach observation becomes a distinct token
Self-attentionrepeated blocks let tokens gather information from relevant dates
Contextual tokensthe final token set is pooled to make a unified prediction
1. observation tokens x1 t1 x2 t2 x3 t3 x4 t4 signal + time encoding mask controls allowed links 2. repeated self-attention blocks Transformer block self-attention MLP repeat L times Transformer block self-attention MLP 3. contextual tokens z1 z2 z3 z4 tokens now include temporal context predict ŷ
Temporal sequences

Attention needs time and availability information

A temporal Transformer does not only need observations. It also needs to know when they happened and whether they should be used.

  • Signal: measurements, indices, or learned features
  • Time: date, lag, phase, or time gap
  • Availability: missing, padded, unreliable, or forbidden tokens

Example temporal sequence

Jan Feb Margap Aprlow May Junkey Julkey Auggap Seplow Oct Nov Dec

The model ignores unavailable dates and can focus attention on the most informative ones.

07
Memory as Learned Dynamics

State space models

Continuous-time models, S4, and Mamba.

Memory as continuous dynamics

Learning rates of change

Instead of storing a finite window, model a hidden state that evolves in continuous time.

\(\frac{d s(t)}{dt}=f_{\theta}\!\left(s(t),u(t),t\right), \qquad y(t)=g_{\theta}\!\left(s(t)\right)\)
  • \(s(t)\): compact latent memory
  • \(f_\theta\): learned dynamics driven by input \(u(t)\)
  • \(g_\theta\): readout producing \(y(t)\) at observation times

Continuous evolution, discrete observations

t s(t) t₁ t₂ t₃ t₄ fθ parameterizes ds/dt

Continuous dynamics evolve across irregular gaps; updates happen at observation times.

Chen et al., "Neural Ordinary Differential Equations", 2018; Rubanova et al., "Latent ODEs for Irregularly-Sampled Time Series", 2019.

Sampling turns dynamics into state updates

State space models

Observe the continuous system only at times \(t_i\).
The elapsed time \(\Delta t_i=t_i-t_{i-1}\) determines how far the state evolves.

continuous dynamics \(\dot{s}(t)=A s(t)+B u(t),\quad y(t)=C s(t)+D u(t)\)
sampled update \(s_i=\bar{A}(\Delta t_i)s_{i-1}+\bar{B}(\Delta t_i)u_i,\quad y_i=C s_i+D u_i\)
  • \(\Delta t_i\): time since the previous observation
  • \(\bar{A}(\Delta t_i)\): state transition over that gap
  • \(\bar{B}(\Delta t_i)\): input effect accumulated over that gap

State updates at observed times

Δt₁ Δt₂ Δt₃ Δt₄ s₀ s₁ s₂ s₃ s₄ t₀ t₁ t₂ t₃ t₄ u₁ u₂ u₃ u₄ y₁ y₂ y₃ y₄

After discretization, the sequence is processed by scanning \(s_0 \rightarrow s_1 \rightarrow \cdots\).

\(A,B,C,D\) define state dynamics, input injection, readout, and skip connection; the layer learns these dynamics.

Gu et al., "Combining recurrent, convolutional, and continuous-time models with linear state space layers", 2021.

Modern SSMs make state memory practical

S4 and Mamba

All state-space models carry a compact hidden state. S4 makes this efficient for long sequences; Mamba makes it selective.

Basic SSM A compact state is updated step by step x₁ x₂ x₃ x₄ s₁ s₂ s₃ s₄ Good memory, but sequential Training and inference follow one chain S4 Same state-space idea, efficient long-filter view x₁ x₂ x₃ x₄ x₅ x₆ one long learned filter y₁ y₂ y₃ y₄ y₅ y₆ Train efficiently like a convolution Structured dynamics become a long filter Mamba The current input decides what memory should do x₁ g s₁ x₂ g s₂ x₃ g s₃ Selective memory Important inputs are written; others can be ignored

S4: Gu, Goel, and Ré, 2021. Mamba: Gu and Dao, 2023.

08
Wrapping things up

Conclusion

Takeaways and references

Takeaways

Which mechanism fits which problem?

Architecture Memory mechanism Parallelism Good fit Watch out
Window + MLP Fixed explicit lags High Baselines; engineered features Choosing window size \(W\)
RNN/GRU/LSTM Learned recurrent state Low across time Streaming; latent stages Hard long training
Reservoir/ESN Random dynamical features Medium Limited labels; fast readouts Tuning sensitive
TCN Receptive field by depth/dilation High Recurring patterns; local motifs Context by design
Transformer Direct attention over observed dates High Irregular and missing data Memory cost
SSM/Mamba State space scan with selective memory High / streaming Long sequences; efficient context Newer ecosystem
Sources and credits

References

  1. Bianchi, Time Series Analysis with Python, online handbook, 2024.
  2. Bianchi, Maiorino, Kampffmeyer, Rizzi, and Jenssen, "Recurrent neural networks for short-term load forecasting: an overview and comparative analysis", Springer, 2017.
  3. Bianchi, Scardapane, Løkse, and Jenssen, "Reservoir computing approaches for representation and classification of multivariate time series", IEEE Transactions on Neural Networks and Learning Systems, 32(5), 2169-2179, 2020.
  4. White, Introduction to Spatial Data in R, SAEON GSN workshop, 2022.
  5. Hochreiter and Schmidhuber, "Long short-term memory", Neural Computation, 9(8), 1735-1780, MIT Press, 1997.
  6. Benidis et al., "Deep Learning for Time Series Forecasting", ACM Computing Surveys, 2022.
  7. Jaeger, "The echo state approach to analysing and training recurrent neural networks", 2001.
  8. Bai, Kolter, and Koltun, "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling", 2018.
  9. Van Den Oord et al., "Wavenet: A generative model for raw audio", arXiv preprint arXiv:1609.03499 12.1, 2016.
  10. Vaswani et al., "Attention Is All You Need", NeurIPS, 2017.
  11. Chen, Rubanova, Bettencourt, and Duvenaud, "Neural Ordinary Differential Equations", NeurIPS, 2018.
  12. Rubanova, Chen, and Duvenaud, "Latent ODEs for Irregularly-Sampled Time Series", NeurIPS, 2019.
  13. Gu et al., "Combining recurrent, convolutional, and continuous-time models with linear state space layers", NeurIPS, 2021.
  14. Gu et al., "Efficiently Modeling Long Sequences with Structured State Spaces", 2021.
  15. Gu and Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces", 2023.
  16. Pelletier, Webb, and Petitjean, "Temporal Convolutional Neural Network for the Classification of Satellite Image Time Series", Remote Sensing, 2019.
  17. Garnot et al., "Satellite Image Time Series Classification with Pixel-Set Encoders and Temporal Self-Attention", CVPR, 2020.
  18. Cini, Marisca, and Zambon, "Graph Deep Learning for Spatiotemporal Time Series", ECML/PKDD tutorial, 2023.
  19. ESA EO Science for Society, "Dynamic Sentinel-2 mosaic service", 2022.
  20. NASA Landsat Science, "Chaco Region Paraguay Time Series", 2025.
  21. NASA Earth Observatory, "Flooding of The Red River", 2015.