The biggest surprise about PyTorch LSTMs for time-series forecasting is that they often underperform simpler statistical models like ARIMA, especially on short or moderately complex sequences, without careful tuning and data preprocessing.
Let’s look at a basic LSTM setup for predicting the next value in a univariate time series. Imagine we have a sequence of stock prices: [100.5, 101.2, 100.9, 102.5, 103.1, 102.8]. We want to train an LSTM to predict the next value after 102.8.
import torch
import torch.nn as nn
import numpy as np
# Sample data (replace with your actual time series)
data = np.array([100.5, 101.2, 100.9, 102.5, 103.1, 102.8], dtype=np.float32)
# --- Data Preprocessing ---
# 1. Scaling: Crucial for LSTMs. Min-Max scaling is common.
min_val = data.min()
max_val = data.max()
scaled_data = (data - min_val) / (max_val - min_val)
# 2. Create sequences: Input (X) and target (y) pairs.
# We'll use a lookback window. Let's say window_size = 3.
window_size = 3
X_list, y_list = [], []
for i in range(len(scaled_data) - window_size):
X_list.append(scaled_data[i:(i + window_size)])
y_list.append(scaled_data[i + window_size])
X = np.array(X_list)
y = np.array(y_list)
# 3. Reshape for LSTM: [samples, timesteps, features]
# For univariate data, features = 1.
X = X.reshape((X.shape[0], X.shape[1], 1))
# Convert to PyTorch tensors
X_tensor = torch.tensor(X)
y_tensor = torch.tensor(y).unsqueeze(1) # Add a feature dimension for regression
# --- LSTM Model ---
class LSTMModel(nn.Module):
def __init__(self, input_size, hidden_size, num_layers, output_size):
super(LSTMModel, self).__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x):
# Initialize hidden and cell states
h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size)
c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size)
# Forward propagate LSTM
out, _ = self.lstm(x, (h0, c0))
# Decode the hidden state of the last time step
out = self.fc(out[:, -1, :])
return out
# Model parameters
input_size = 1 # Number of features in each step (univariate)
hidden_size = 50 # Number of features in the hidden state
num_layers = 2 # Number of stacked LSTM layers
output_size = 1 # Number of output values (next time step)
model = LSTMModel(input_size, hidden_size, num_layers, output_size)
# --- Training ---
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
num_epochs = 1000
for epoch in range(num_epochs):
outputs = model(X_tensor)
loss = criterion(outputs, y_tensor)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (epoch+1) % 100 == 0:
print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')
# --- Prediction ---
# Prepare the last sequence from the original data for prediction
last_sequence = scaled_data[-window_size:]
last_sequence = last_sequence.reshape((1, window_size, 1)) # [batch_size, timesteps, features]
last_sequence_tensor = torch.tensor(last_sequence)
model.eval() # Set model to evaluation mode
with torch.no_grad():
predicted_scaled = model(last_sequence_tensor)
predicted_value = predicted_scaled.item()
# Inverse transform the prediction
predicted_original_scale = predicted_value * (max_val - min_val) + min_val
print(f"\nPredicted next value: {predicted_original_scale:.2f}")
This code sets up a basic LSTM. We first scale the data, then create overlapping input sequences and their corresponding target values. The LSTM model itself consists of an nn.LSTM layer followed by a nn.Linear layer to output a single prediction. Training involves minimizing the Mean Squared Error (MSE) between predicted and actual values. Finally, we use the last window_size data points to predict the next value and then inverse-transform it back to the original scale.
The problem LSTMs solve is capturing sequential dependencies. Unlike a simple feed-forward network that treats each input independently, an LSTM has internal "memory" (the hidden and cell states) that allows it to retain information from previous time steps. This is crucial for time series where the future often depends on past patterns, trends, and seasonality. The "gates" within an LSTM cell (input, forget, output) dynamically control what information is stored, updated, and outputted, enabling it to learn long-range dependencies that simpler recurrent neural networks (RNNs) struggle with.
The core idea is to unroll the network through time. At each time step t, the LSTM receives an input x_t and the hidden state h_{t-1} and cell state c_{t-1} from the previous step. It then computes a new hidden state h_t and cell state c_t, which are passed to the next time step, and an output y_t. For forecasting, we typically only care about the output at the last time step of an input sequence, which is then used to predict the next value in the series.
The batch_first=True argument in nn.LSTM means that the input and output tensors are expected in the format [batch_size, sequence_length, input_size]. If batch_first=False (the default), it’s [sequence_length, batch_size, input_size]. For our use case with multiple training samples, batch_first=True often makes indexing and handling easier, especially when accessing the output of the last time step (out[:, -1, :]).
The most overlooked aspect of using LSTMs for time series is the sensitivity to the hidden_size and num_layers hyperparameters. While it seems like more layers and larger hidden states should always capture more complex patterns, excessively large values can lead to overfitting, especially with limited training data. They also increase computational cost. Often, a hidden_size between 20-100 and num_layers of 1-3 is sufficient for many time-series problems. The real magic happens in how you structure your data and the learning rate.
Once you’ve mastered basic univariate forecasting, the next challenge is multivariate time series, where you predict future values based on multiple correlated input series, which requires careful feature engineering and understanding how to input multiple features into the input_size of the LSTM.