The most surprising thing about training PyTorch embeddings is how little they actually change during typical NLP training runs.

Let’s see it in action. Imagine we’re training a tiny model to classify the sentiment of movie reviews. Our vocabulary is small: "good," "bad," "movie," "is." We’ll initialize an embedding layer with random weights.

import torch
import torch.nn as nn

vocab_size = 4
embedding_dim = 8
embedding_layer = nn.Embedding(vocab_size, embedding_dim)

# Our simple vocabulary mapping
word_to_id = {"good": 0, "bad": 1, "movie": 2, "is": 3}

# Example sentence: "This movie is good"
sentence = ["movie", "is", "good"]
indices = [word_to_id[word] for word in sentence]
input_tensor = torch.tensor(indices)

# Forward pass: get the embeddings
embeddings = embedding_layer(input_tensor)
print("Initial embeddings shape:", embeddings.shape)
print("Initial embeddings:\n", embeddings)

Now, let’s pretend we’re doing one training step. We have a dummy loss and we’ll backpropagate.

# Dummy loss calculation (e.g., mean of embeddings)
loss = embeddings.mean()
loss.backward()

# In a real scenario, an optimizer would update the weights.
# For demonstration, let's see the gradients.
print("\nGradients after one backward pass:\n", embedding_layer.weight.grad)

Notice how the gradients are computed. They are essentially the "delta" needed to adjust each embedding vector. If we were to take a step with an optimizer like Adam, the actual embedding vectors would shift slightly. However, in a large model trained on a massive dataset, each word’s embedding is influenced by millions of sentences, and the updates for any single word in a single batch are minuscule. The embedding vectors are learned by observing co-occurrence patterns and contextual usage across vast amounts of text. They are not explicitly assigned rules but emerge from statistical regularities.

The core problem embeddings solve is representing discrete words as continuous, dense vectors. This allows neural networks to perform mathematical operations on word meanings. Instead of one-hot encoding each word (which creates huge, sparse vectors and implies no relationship between words), embeddings learn a fixed-size vector for each word where similar words (in context) have similar vectors.

Internally, nn.Embedding is essentially a lookup table. When you pass a tensor of word indices, it retrieves the corresponding rows from its weight matrix. This weight matrix is the embedding layer. Its dimensions are (vocab_size, embedding_dim). Each row is a vector of size embedding_dim representing a word.

The embedding_dim is a hyperparameter you tune. Too small, and the embeddings might not capture enough nuance. Too large, and you risk overfitting and increased computational cost. Typical values range from 50 to 300 for word embeddings, but can go much higher for more complex tasks or larger vocabularies.

The process of training these embeddings involves presenting the model with text data, allowing it to predict something (like the next word, or a label), calculating a loss based on that prediction, and then using backpropagation to adjust the embedding vectors (along with other model parameters) to minimize that loss. Over many such steps, the embedding vectors for words that appear in similar contexts will converge to similar points in the embedding space.

What most people don’t realize is that the "meaning" captured by an embedding is entirely derived from its relationships with other embeddings in the learned space. The absolute values of the vector components are arbitrary; it’s the relative distances and directions between vectors that encode semantic and syntactic information. For instance, in well-trained embeddings, the vector difference between "king" and "man" might be very similar to the vector difference between "queen" and "woman."

The next hurdle is understanding how to fine-tune these pre-trained embeddings for specific downstream tasks.

Want structured learning?

Take the full Pytorch course →