Note: The domain, table names, and data model shown in this post have been changed to protect the product’s IP. The migration patterns and lessons are real.
We migrated a production app from Next.js + Drizzle + Postgres (on Neon) to Convex. Not a greenfield rewrite—a live cutover across roughly 20 tables, some with more than 10 million rows and deep foreign-key dependencies between them.
I’m keeping the product anonymous, but context matters: this was a collaborative GenAI canvas with over 500k users at the time of migration. Realtime wasn’t a nice-to-have—it was the product.
This is the story of what actually worked, what broke, and what I’d do differently next time.
Why we moved
Our old stack did its job, but it fought us on the things that mattered most. We had collaboration features—shared spending pools, live canvas state, multi-user editing—that all relied on polling loops and custom sync code. Every new realtime feature meant another layer of glue: server routes to write data, polling intervals to detect changes, client-side reconciliation to keep things consistent. The async job story was similar—background work ran through polling-based workers that added operational complexity without adding product value.
Convex promised a tighter model: server-side functions with native subscriptions where mutations automatically push updates to every connected client. No polling. No manual cache invalidation. For a collaboration-heavy product, that model was exactly right.
The shape of the problem
About 20 tables in Postgres, some with more than 10 million rows. Strong FK dependencies between entities. Live traffic from 500k+ users. Target: migrate everything with less than a few hours of downtime and zero data loss.
And a nice surprise we discovered along the way: significant amounts of broken relational data already living in Postgres that nobody knew about.
Target architecture
We kept the Convex data model close to our Postgres schema. The critical addition was a pg_id field on every migrated entity—a bridge column that let us trace documents back to their Postgres origin.
That pg_id did three things for us. It let us find existing Convex documents by legacy Postgres identity, which made upserts deterministic. It gave us a key for reconstructing relationships—when migrating a product that references a store, we could look up the store’s Convex _id through its pg_id. And it gave us traceability during debugging, which we needed more than we expected.
Because Convex generates its own document IDs, we couldn’t just dump tables and preserve references. We had to migrate entities in dependency order, resolve references through pg_id, and let Convex generate canonical IDs. This is where most of the real complexity lived.
Clerk stayed unchanged for authentication. The only architectural shift was that Convex became the canonical data layer, with Convex IDs as runtime references.
What we built
The migration system had two halves. On the Next.js side, an orchestrator walked through tables in dependency order and fanned out parallel workers per table. On the Convex side, bulk import mutations received batches and performed idempotent upserts.
flowchart LR
subgraph Next.js
O[Orchestrator] --> W1[Worker 1]
O --> W2[Worker 2]
O --> W3[Worker N]
end
subgraph Postgres
W1 -->|paginate| PG[(Unmigrated rows)]
W2 -->|paginate| PG
W3 -->|paginate| PG
end
subgraph Convex
W1 -->|batch| M[Bulk import mutation]
W2 -->|batch| M
W3 -->|batch| M
M -->|query by_pg_id| D[(Convex DB)]
M -->|patch or insert| D
end
M -->|migrated IDs| W1
PG -->|status writeback| PG
The dependency ordering was explicit—parents before children—because a product can’t resolve its storeId foreign key until the store document exists in Convex. The orchestrator processed tables sequentially in a fixed order:
const tableProcessors = [
{ name: "Stores", fn: processStores },
{ name: "Customers", fn: processCustomers },
{ name: "Subscriptions", fn: processSubscriptions },
{ name: "Store Memberships", fn: processStoreMemberships },
{ name: "Categories", fn: processCategories },
{ name: "Products", fn: processProducts },
{ name: "Orders", fn: processOrders },
{ name: "Order History", fn: processOrderHistory },
// ...28 tables total
];
for (const table of tablesToProcess) {
const result = await table.fn();
results[table.name] = result;
}
The table dependency graph looked roughly like this—each table had to wait for its parents to converge first:
flowchart TD Stores --> Customers Stores --> Subscriptions Stores --> StoreMemberships[Store Memberships] Customers --> StoreMemberships Customers --> Categories Stores --> Categories Categories --> Products Customers --> Products Stores --> Products Products --> Orders Customers --> Orders Stores --> Orders Products --> OrderHistory[Order History] Customers --> OrderHistory Orders --> OrderHistory
Within each table, the orchestrator counted unmigrated rows, calculated how many parallel workers to spin up, and each worker got an offset range to process. Workers paginated through Postgres, batched rows, and sent them to Convex.
The idempotent upsert
Every Convex bulk import mutation followed the same contract: look up by pg_id, patch if found, insert if not. This made every migration call safe to retry.
async function importStore(
ctx: MutationCtx,
input: StoreImportData
) {
const { id: pg_id, ownerEmail, ...rest } = input;
try {
const existing = await ctx.db
.query("stores")
.withIndex("by_pg_id", (q) => q.eq("pg_id", pg_id))
.first();
if (existing) {
await ctx.db.patch(existing._id, { ...rest });
return pg_id;
}
await ctx.db.insert("stores", { ...rest, pg_id });
return pg_id;
} catch (error) {
await ctx.db.insert("migrationFailures", {
pg_id,
tableName: "stores",
error: error instanceof Error
? error.message
: "Unknown error",
});
return null;
}
}
For entities with foreign keys, the pattern extended to resolve dependencies first. Here’s a simplified version of our orders import, which needed to resolve three parent references—customer, store, and product—before it could upsert:
async function importOrder(
ctx: MutationCtx,
input: OrderImportData
) {
const {
id: pg_id,
customerEmail,
storeId: pg_storeId,
productId: pg_productId,
...rest
} = input;
// Resolve all FK references through indexes
const customer = await ctx.db
.query("customers")
.withIndex("by_email", (q) =>
q.eq("email", customerEmail)
)
.first();
const store = pg_storeId
? await ctx.db
.query("stores")
.withIndex("by_pg_id", (q) =>
q.eq("pg_id", pg_storeId)
)
.first()
: undefined;
const product = pg_productId
? await ctx.db
.query("products")
.withIndex("by_pg_id", (q) =>
q.eq("pg_id", pg_productId)
)
.first()
: undefined;
// Same upsert-by-pg_id pattern
const existing = await ctx.db
.query("orders")
.withIndex("by_pg_id", (q) => q.eq("pg_id", pg_id))
.first();
const data = {
customerId: customer._id,
storeId: store?._id,
productId: product?._id,
...rest,
};
if (existing) {
await ctx.db.patch(existing._id, data);
} else {
await ctx.db.insert("orders", { pg_id, ...data });
}
return pg_id;
}
If I had to pick one design principle that carried the entire migration, it would be this idempotent upsert contract. Without it, our daily sync strategy—which I’ll explain next—would have created duplicates and corrupted data.
Roadblock: bad source data disguised as migration bugs
Early test runs looked broken. Relationship failures everywhere. We spent days debugging our migration logic before realizing the real issue: significant broken relational data already existed in Postgres. Foreign keys pointing at deleted rows, orphaned records, references to entities that never existed.
Convex made these problems visible because FK remapping was explicit in our code path. In Postgres, these broken references sat quietly. In our migration code, they threw errors.
Once we understood this, we changed our debugging posture. Instead of asking “why did migration fail?”, we started with triage: is this a migration logic bug, a source data integrity issue, or expected data drift from live writes? Each failure class has a different fix. We added a migrationFailures table in Convex that recorded errors by table and reason, which replaced hours of log reading with structured queries.
The strategic change was simple: treat unresolved required relationships as invalid source data and skip those rows rather than forcing partial corruption into Convex.
Roadblock: the first dry run took 30 hours
Our first full migration test on production-scale data took over 30 hours. That was incompatible with any reasonable downtime window.
So we switched from a one-shot migration to incremental convergence. The idea: track what’s been migrated, only process what hasn’t, and let the system converge over time.
We added a migratedToConvex column to every table in Postgres—a state machine with four states:
export enum MigrationStatus {
NOT_MIGRATED = 0,
MIGRATION_ERROR = 1,
SUCCESSFULLY_MIGRATED = 2,
SUCCESSFULLY_MIGRATED_AFTER_ERROR = 3,
}
Here’s the key trick—we used Drizzle’s $onUpdate hook to automatically reset migration status whenever a row was modified through normal app usage:
export const stores = pgTable("stores", {
// ...columns...
migratedToConvex: smallint("migrated_to_convex")
.default(MigrationStatus.NOT_MIGRATED)
.$type<MigrationStatus>()
.$onUpdate(() => MigrationStatus.NOT_MIGRATED),
});
Any production write reset the row to NOT_MIGRATED, so the next sync pass would automatically re-sync it to Convex. After migration calls returned, the Next.js side wrote status back—marking each row as either successfully migrated or errored:
export async function processMigrationResult<TId>(
table: any,
attemptedIds: TId[],
successfulIds: TId[],
) {
const failedIds = attemptedIds.filter(
(id) => !successfulIds.includes(id)
);
await Promise.all([
updateMigratedEntities(table, successfulIds),
markEntitiesAsError(table, failedIds),
]);
}
We ran the initial backfill (which took about 30 hours), then ran daily sync passes that only processed unmigrated or errored rows. Each pass moved rows through the state machine. By cutover day, the remaining delta was small enough to fit a few-hours window.
stateDiagram-v2 [*] --> NOT_MIGRATED NOT_MIGRATED --> SUCCESSFULLY_MIGRATED: sync pass succeeds NOT_MIGRATED --> MIGRATION_ERROR: sync pass fails MIGRATION_ERROR --> MIGRATED_AFTER_ERROR: retry succeeds SUCCESSFULLY_MIGRATED --> NOT_MIGRATED: app writes to row MIGRATED_AFTER_ERROR --> NOT_MIGRATED: app writes to row
This was the turning point. Once we accepted that a single-shot migration wouldn’t meet our downtime target, the problem became manageable. Migration went from a scary one-shot event to a boring daily background process with a steadily shrinking delta.
Throughput: what Convex can actually ingest
We discovered through trial and error that Convex write throughput during migration was lower than expected. The safe ceiling landed around ~1k records per call for simple tables, but some tables needed significantly smaller batches. The batch sizes ended up tuned per table:
export const STORES_BATCH_SIZE = 1000;
export const ORDERS_BATCH_SIZE = 500;
export const PRODUCTS_BATCH_SIZE = 150;
export const CUSTOMERS_BATCH_SIZE = 40;
export const ORDER_HISTORY_BATCH_SIZE = 100;
Customers at 40 records per batch looks surprising until you understand why: each customer insert triggers multiple index updates, and Convex index builds dominate write throughput during heavy imports. The Convex team confirmed this and helped us tune behavior.
We also had to deal with Convex’s request body limit. Some of our records—like order history with large JSON blobs—could blow past it. We added a binary search that found the maximum number of documents that actually fit in a single request:
function doesFitInConvexRequest<T>(docs: T[]) {
return JSON.stringify(docs).length < 2_000_000;
}
export function findMaxDocsForRequest<T>(docs: T[]) {
if (doesFitInConvexRequest(docs)) return docs.length;
let left = 1;
let right = docs.length;
let maxFit = 0;
while (left <= right) {
const mid = Math.floor((left + right) / 2);
if (doesFitInConvexRequest(docs.slice(0, mid))) {
maxFit = mid;
left = mid + 1;
} else {
right = mid - 1;
}
}
return maxFit;
}
The lesson: don’t assume one global batch size is good enough. Tune per table and per workload profile.
Cutover day
We did not dual-write. We backfilled, ran daily syncs for about two weeks, then took a few-hours downtime window for the final sync and traffic switch.
gantt
title Migration timeline
dateFormat X
axisFormat %s
section Backfill
Initial full migration (~30h) :done, 0, 30
section Daily sync
Sync pass 1 :done, 31, 32
Sync pass 2 :done, 33, 34
Sync passes 3-14 :done, 35, 46
section Cutover
Freeze writes :crit, 47, 48
Final sync :crit, 48, 50
Validate and switch traffic :crit, 50, 51
Live on Convex :active, 51, 55
After cutover, we stopped writing to Postgres entirely. We didn’t build reverse replication from Convex back to Postgres, so rollback was not realistic.
The runbook was roughly: freeze writes, run final sync over remaining unmigrated rows, validate critical entity counts and relationship integrity, switch read/write paths to Convex, monitor error rates and subscription behavior.
Everything held. But because rollback was weak, the emotional profile of that window was “high confidence in prep, combined with a lot of hope.” This was one of the biggest strategic gaps in our plan.
Results and operational reality
The outcome was positive. Performance improved across our core paths—Convex actions, queries, and mutations are significantly faster than what we had with Next.js server routes hitting Postgres. Realtime collaboration became a first-class citizen instead of a polling hack. Cost dropped by roughly 50% compared to our previous Vercel bill.
Developer experience is honestly mixed but improving. We weren’t instantly expert Convex developers, and some early production incidents were self-inflicted.
The biggest example: we created query patterns with “global” values that many users subscribed to simultaneously. When we deployed, Convex revalidated all active queries, and every subscriber reconnected at once—a textbook thundering herd. Every push to prod effectively caused downtime.
The Convex team was genuinely helpful here. They gave us more infrastructure to absorb the load and helped us understand the root cause. The fix on our side was narrowing subscription scope and getting strict about index-first query design. In a collaborative product where thousands of users are connected simultaneously, a poorly scoped subscription can amplify load faster than you’d expect.
The filter-vs-index distinction in Convex is high leverage and easy to get wrong. If you filter client-side or use .filter() instead of .withIndex(), you’ll pay for it at scale. We learned that the hard way.
Mistakes we made
- We skipped dual-write and gave ourselves no rollback path. Everything worked, but we were one bad bug away from a very painful situation.
- We treated source data quality issues as migration bugs for too long. Broken FK relationships in Postgres surfaced as migration failures. Days of debugging before we realized the data was already broken.
- We introduced subscription patterns that amplified load during deployments. Global query shapes + redeployment revalidation = thundering herd.
Would I do it again?
For this class of product—yes, without hesitation. If I were building another startup collaboration app with strong realtime requirements, I’d choose Convex again. The model fits the problem space well, and the performance and cost profile was compelling.
But Convex isn’t a universal answer. I wouldn’t use it as my primary system for heavy search or aggregation workloads—push those into specialized systems. I’d be cautious in enterprise contexts that need mature compliance and procurement tooling from day one. And I’d be very deliberate about schema and index design upfront, because how you structure your indexes in Convex has outsized impact on both performance and operational stability.
The biggest lesson from this migration: success depended less on copying rows and more on idempotency, relationship ordering, and operational discipline. If I were writing the v2 playbook, it would start with a dual-write path, a tested rollback plan, and a pre-migration data quality audit—before a single row moves.