Most denormalization debates are framed as performance questions. The better question is consistency: what data is canonical, and what data is just an optimization?
A practical rule:
- normalize source-of-truth data
- denormalize derived or cached data
If a value can be recomputed from canonical tables, denormalization can be a good tradeoff. If it represents mutable business truth, duplication usually creates drift.
A quick scaling reminder: Biebergram
One of Instagram’s early scaling stories came from the July 2011 traffic spike after Justin Bieber posted and drove massive load. At that scale, you stop recomputing everything with fresh joins on every request and precompute hot read paths, especially counters.
That is denormalization in practice: accept slightly more complex writes to make reads fast and predictable.
Reference: TechCrunch - Biebergram (2011)
Denormalization vs aggregates: when to use each
Treat this as three options:
| Approach | Best for | Freshness | Write complexity | Read speed |
|---|---|---|---|---|
Query-time aggregate (COUNT, SUM, GROUP BY) | Simple apps, lower traffic, correctness-first | Always current | Low | Can degrade at scale |
Denormalized column (posts.like_count) | Hot endpoints, high read QPS | Usually near real-time | Medium/High | Fastest |
| Materialized aggregate (view/summary table) | Dashboards, reporting, analytics | Stale between refreshes | Medium | Fast |
Decision flow
- Start with query-time aggregates.
- Add indexes and measure.
- If hot-path latency is still too high, denormalize that specific field.
- If this is reporting/analytics, prefer materialized aggregates over per-row denormalization.
How to implement each approach
1) Query-time aggregate (default starting point)
SELECT post_id, COUNT(*) AS like_count
FROM post_likes
WHERE post_id = $1
GROUP BY post_id;
Simple and always correct, but it can get expensive on very hot endpoints.
2) Denormalized counter column (read-optimized)
ALTER TABLE posts ADD COLUMN like_count BIGINT NOT NULL DEFAULT 0;
On write (same transaction):
INSERT INTO post_likes (post_id, user_id) VALUES ($1, $2);
UPDATE posts
SET like_count = like_count + 1
WHERE id = $1;
Read path is now O(1):
SELECT id, title, like_count
FROM posts
WHERE id = $1;
3) Materialized aggregate (reporting-optimized)
CREATE MATERIALIZED VIEW post_like_totals AS
SELECT post_id, COUNT(*) AS like_count
FROM post_likes
GROUP BY post_id;
Refresh on a schedule:
REFRESH MATERIALIZED VIEW CONCURRENTLY post_like_totals;
Great for dashboards where slight staleness is acceptable.
Good denormalization vs bad denormalization
Good denormalization:
posts.like_countusers.post_countaccounts.last_login_at
These are derived values and can be repaired.
Bad denormalization:
- duplicating mutable truth like
customer_emailacross many tables without clear snapshot semantics
If you need a historical copy, make intent explicit:
ALTER TABLE orders ADD COLUMN billing_email_snapshot TEXT NOT NULL;
Now it is intentionally point-in-time, not “live” profile data.
Safety rules for denormalized data
- Keep one canonical source table.
- Define acceptable staleness up front.
- Update denormalized fields in the same transaction when possible.
- Add reconciliation jobs to repair drift.
- Monitor mismatch rate so inconsistencies are visible.
Denormalization is not a schema mistake. It is a performance tool. Use it after measuring, apply it narrowly, and make correctness guarantees explicit.