Most denormalization debates are framed as performance questions. The better question is consistency: what data is canonical, and what data is just an optimization?

A practical rule:

  • normalize source-of-truth data
  • denormalize derived or cached data

If a value can be recomputed from canonical tables, denormalization can be a good tradeoff. If it represents mutable business truth, duplication usually creates drift.

A quick scaling reminder: Biebergram

One of Instagram’s early scaling stories came from the July 2011 traffic spike after Justin Bieber posted and drove massive load. At that scale, you stop recomputing everything with fresh joins on every request and precompute hot read paths, especially counters.

That is denormalization in practice: accept slightly more complex writes to make reads fast and predictable.

Reference: TechCrunch - Biebergram (2011)

Denormalization vs aggregates: when to use each

Treat this as three options:

ApproachBest forFreshnessWrite complexityRead speed
Query-time aggregate (COUNT, SUM, GROUP BY)Simple apps, lower traffic, correctness-firstAlways currentLowCan degrade at scale
Denormalized column (posts.like_count)Hot endpoints, high read QPSUsually near real-timeMedium/HighFastest
Materialized aggregate (view/summary table)Dashboards, reporting, analyticsStale between refreshesMediumFast

Decision flow

  1. Start with query-time aggregates.
  2. Add indexes and measure.
  3. If hot-path latency is still too high, denormalize that specific field.
  4. If this is reporting/analytics, prefer materialized aggregates over per-row denormalization.

How to implement each approach

1) Query-time aggregate (default starting point)

SELECT post_id, COUNT(*) AS like_count
FROM post_likes
WHERE post_id = $1
GROUP BY post_id;

Simple and always correct, but it can get expensive on very hot endpoints.

2) Denormalized counter column (read-optimized)

ALTER TABLE posts ADD COLUMN like_count BIGINT NOT NULL DEFAULT 0;

On write (same transaction):

INSERT INTO post_likes (post_id, user_id) VALUES ($1, $2);

UPDATE posts
SET like_count = like_count + 1
WHERE id = $1;

Read path is now O(1):

SELECT id, title, like_count
FROM posts
WHERE id = $1;

3) Materialized aggregate (reporting-optimized)

CREATE MATERIALIZED VIEW post_like_totals AS
SELECT post_id, COUNT(*) AS like_count
FROM post_likes
GROUP BY post_id;

Refresh on a schedule:

REFRESH MATERIALIZED VIEW CONCURRENTLY post_like_totals;

Great for dashboards where slight staleness is acceptable.

Good denormalization vs bad denormalization

Good denormalization:

  • posts.like_count
  • users.post_count
  • accounts.last_login_at

These are derived values and can be repaired.

Bad denormalization:

  • duplicating mutable truth like customer_email across many tables without clear snapshot semantics

If you need a historical copy, make intent explicit:

ALTER TABLE orders ADD COLUMN billing_email_snapshot TEXT NOT NULL;

Now it is intentionally point-in-time, not “live” profile data.

Safety rules for denormalized data

  1. Keep one canonical source table.
  2. Define acceptable staleness up front.
  3. Update denormalized fields in the same transaction when possible.
  4. Add reconciliation jobs to repair drift.
  5. Monitor mismatch rate so inconsistencies are visible.

Denormalization is not a schema mistake. It is a performance tool. Use it after measuring, apply it narrowly, and make correctness guarantees explicit.