Data Normalization: Beyond Cleaning

Database normalization is often taught as a way to reduce redundancy, but its real power lies in making your data changeable.

Let’s see it in action. Imagine we have a simple table tracking book orders:

-- Unnormalized Table
CREATE TABLE BookOrders (
    OrderID INT PRIMARY KEY,
    OrderDate DATE,
    CustomerID INT,
    CustomerName VARCHAR(100),
    CustomerAddress VARCHAR(255),
    BookID INT,
    BookTitle VARCHAR(200),
    BookAuthor VARCHAR(100),
    Quantity INT,
    Price DECIMAL(10, 2)
);

INSERT INTO BookOrders (OrderID, OrderDate, CustomerID, CustomerName, CustomerAddress, BookID, BookTitle, BookAuthor, Quantity, Price) VALUES
(101, '2023-10-26', 501, 'Alice Wonderland', '123 Main St', 10, 'The Great Gatsby', 'F. Scott Fitzgerald', 1, 15.99),
(101, '2023-10-26', 501, 'Alice Wonderland', '123 Main St', 11, '1984', 'George Orwell', 2, 12.50),
(102, '2023-10-26', 502, 'Bob The Builder', '456 Oak Ave', 10, 'The Great Gatsby', 'F. Scott Fitzgerald', 1, 15.99);

Now, what if Alice moves? You have to update her address in every single row associated with her order. If you miss one, her address is inconsistent. What if a book title changes slightly? Same problem. This is where normalization shines.

Normalization breaks down a large table into smaller, related tables, each holding a single type of information. This ensures that a piece of data, like a customer’s address or a book’s title, is stored in only one place.

First Normal Form (1NF): Atomic Values

The first rule is that every column must contain atomic values, and each record must be unique. "Atomic" means indivisible – you can’t have a list or a set of values in a single cell.

Our BookOrders table is in 1NF because each cell contains a single value. OrderID is the primary key, ensuring each row is unique.

Second Normal Form (2NF): No Partial Dependencies

To be in 2NF, a table must first be in 1NF. Then, all non-key attributes must be fully functionally dependent on the entire primary key. This rule only matters for tables with composite primary keys (keys made of multiple columns).

Our BookOrders table has a simple primary key (OrderID), so it’s automatically in 2NF. If OrderID were a composite key (e.g., (OrderID, BookID)), then BookTitle and BookAuthor would be a problem if they only depended on BookID, not the whole composite key.

Third Normal Form (3NF): No Transitive Dependencies

For a table to be in 3NF, it must be in 2NF. The critical rule here is that there should be no transitive dependencies. A transitive dependency exists when a non-key attribute depends on another non-key attribute, which in turn depends on the primary key.

Let’s normalize our BookOrders table. We can see that CustomerName and CustomerAddress depend on CustomerID, which is not the primary key of the BookOrders table. Similarly, BookTitle and BookAuthor depend on BookID.

We’ll create three new tables:

Customers: To store customer information.
Books: To store book information.
Orders: To store order details, linking customers and books.

-- Customers Table (1NF, 2NF, 3NF)
CREATE TABLE Customers (
    CustomerID INT PRIMARY KEY,
    CustomerName VARCHAR(100),
    CustomerAddress VARCHAR(255)
);

-- Books Table (1NF, 2NF, 3NF)
CREATE TABLE Books (
    BookID INT PRIMARY KEY,
    BookTitle VARCHAR(200),
    BookAuthor VARCHAR(100)
);

-- Orders Table (1NF, 2NF, 3NF)
CREATE TABLE Orders (
    OrderID INT PRIMARY KEY,
    OrderDate DATE,
    CustomerID INT,
    FOREIGN KEY (CustomerID) REFERENCES Customers(CustomerID)
);

-- OrderItems Table (1NF, 2NF, 3NF) - Handles the many-to-many between Orders and Books
CREATE TABLE OrderItems (
    OrderItemID INT PRIMARY KEY AUTO_INCREMENT, -- Or use a composite key (OrderID, BookID)
    OrderID INT,
    BookID INT,
    Quantity INT,
    Price DECIMAL(10, 2),
    FOREIGN KEY (OrderID) REFERENCES Orders(OrderID),
    FOREIGN KEY (BookID) REFERENCES Books(BookID)
);

-- Inserting data into normalized tables
INSERT INTO Customers (CustomerID, CustomerName, CustomerAddress) VALUES
(501, 'Alice Wonderland', '123 Main St'),
(502, 'Bob The Builder', '456 Oak Ave');

INSERT INTO Books (BookID, BookTitle, BookAuthor) VALUES
(10, 'The Great Gatsby', 'F. Scott Fitzgerald'),
(11, '1984', 'George Orwell');

INSERT INTO Orders (OrderID, OrderDate, CustomerID) VALUES
(101, '2023-10-26', 501),
(102, '2023-10-26', 502);

INSERT INTO OrderItems (OrderID, BookID, Quantity, Price) VALUES
(101, 10, 1, 15.99),
(101, 11, 2, 12.50),
(102, 10, 1, 15.99);

Now, if Alice moves, you only update her CustomerAddress in one row in the Customers table. If a book title changes, you update it in one row in the Books table. This is the core benefit: data integrity and ease of modification.

The trickiest part of normalization is often identifying what constitutes a "non-key attribute" and understanding what it depends on. In 3NF, we’re specifically looking for a situation where Attribute A depends on Attribute B, and Attribute B depends on the primary key. For example, if we had CustomerID -> CustomerAddress and CustomerID -> CustomerCity, and CustomerCity was in a separate table linked by CustomerID, that’s good. But if CustomerCity was stored directly in the Customers table and CustomerAddress also contained the city, and CustomerAddress was not part of the primary key, that would be a transitive dependency.

The next step is understanding how to join these tables efficiently for querying, which leads into denormalization for performance in specific use cases.