Denormalize for Speed, Normalize for Truth

Most denormalization debates are framed as performance questions. The better question is consistency: what data is canonical, and what data is just an optimization?

A practical rule:

normalize source-of-truth data
denormalize derived or cached data

If a value can be recomputed from canonical tables, denormalization can be a good tradeoff. If it represents mutable business truth, duplication usually creates drift.

A quick scaling reminder: Biebergram

One of Instagram’s early scaling stories came from the July 2011 traffic spike after Justin Bieber posted and drove massive load. At that scale, you stop recomputing everything with fresh joins on every request and precompute hot read paths, especially counters.

That is denormalization in practice: accept slightly more complex writes to make reads fast and predictable.

Reference: TechCrunch - Biebergram (2011)

Denormalization vs aggregates: when to use each

Treat this as three options:

Approach	Best for	Freshness	Write complexity	Read speed
Query-time aggregate (`COUNT`, `SUM`, `GROUP BY`)	Simple apps, lower traffic, correctness-first	Always current	Low	Can degrade at scale
Denormalized column (`posts.like_count`)	Hot endpoints, high read QPS	Usually near real-time	Medium/High	Fastest
Materialized aggregate (view/summary table)	Dashboards, reporting, analytics	Stale between refreshes	Medium	Fast

Decision flow

Start with query-time aggregates.
Add indexes and measure.
If hot-path latency is still too high, denormalize that specific field.
If this is reporting/analytics, prefer materialized aggregates over per-row denormalization.

How to implement each approach

1) Query-time aggregate (default starting point)

SELECT post_id, COUNT(*) AS like_count
FROM post_likes
WHERE post_id = $1
GROUP BY post_id;

Simple and always correct, but it can get expensive on very hot endpoints.

2) Denormalized counter column (read-optimized)

ALTER TABLE posts ADD COLUMN like_count BIGINT NOT NULL DEFAULT 0;

On write (same transaction):

INSERT INTO post_likes (post_id, user_id) VALUES ($1, $2);

UPDATE posts
SET like_count = like_count + 1
WHERE id = $1;

Read path is now O(1):

SELECT id, title, like_count
FROM posts
WHERE id = $1;

3) Materialized aggregate (reporting-optimized)

CREATE MATERIALIZED VIEW post_like_totals AS
SELECT post_id, COUNT(*) AS like_count
FROM post_likes
GROUP BY post_id;

Refresh on a schedule:

REFRESH MATERIALIZED VIEW CONCURRENTLY post_like_totals;

Great for dashboards where slight staleness is acceptable.

Good denormalization vs bad denormalization

Good denormalization:

posts.like_count
users.post_count
accounts.last_login_at

These are derived values and can be repaired.

Bad denormalization:

duplicating mutable truth like customer_email across many tables without clear snapshot semantics

If you need a historical copy, make intent explicit:

ALTER TABLE orders ADD COLUMN billing_email_snapshot TEXT NOT NULL;

Now it is intentionally point-in-time, not “live” profile data.

Safety rules for denormalized data

Keep one canonical source table.
Define acceptable staleness up front.
Update denormalized fields in the same transaction when possible.
Add reconciliation jobs to repair drift.
Monitor mismatch rate so inconsistencies are visible.

Denormalization is not a schema mistake. It is a performance tool. Use it after measuring, apply it narrowly, and make correctness guarantees explicit.