Tech Blog

Migrating Terabytes of MongoDB Across Clouds with Minimal Disruption

By Vishnu Namboodiri | Published October 9, 2025

Migrating terabytes of live MongoDB data across cloud providers is notoriously difficult. Standard approaches like extending replica sets across clouds fail at scale where initial syncs take weeks, oplogs roll over, and you're stuck in an endless retry loop.

At Haptik, we needed to migrate a multi-terabyte MongoDB replica set from Cloud A to Cloud B with minimal disruption. Here's how we did it.

The Problem

Traditional MongoDB migrations rely on initial sync, but at terabyte scale:

Data transfer takes days or weeks
Oplogs roll over before sync completes
Continuous failed syncs drain resources
You can't afford extended downtime

Our Solution: VM Migration + Change Data Capture

We combined three techniques:

VM-based seeding: Migrating an existing secondary node
Change streams: Capturing ongoing changes in real-time
Delta replay: Small maintenance window to backfill missed changes

The Migration Process

Here's a high-level view of our setup.

Mongo Migration

Add a hidden secondary

We added a hidden, non-electing secondary (priority: 0, votes: 0) to the source replica set. This became our migration seed without affecting production.

Deploy delta store

We set up a temporary MongoDB instance with a Python script using change streams to capture every insert/update/delete from source collections while bulk migration happened.

Critical monitoring

We configured alerts to ensure no data loss

Change stream lag monitoring tracked the time difference between source writes and delta store captures with scheduled scripts
Resume token health checks ensured tokens were being persisted correctly
Delta store disk space alerts prevented storage exhaustion during high-write periods
Script Kubernetes liveness probes to automatically restart failed CDC pods
Missing operation detection for periodic reconciliation checks comparing source and delta store document counts

Migrate the seed node

We gracefully stopped the hidden secondary and used cloud VM migration services to clone the entire VM (with data) to Cloud B. This transferred terabytes at disk speeds, bypassing network bottlenecks.

Build target replica set

We brought up the migrated VM in Cloud B and used it to bootstrap a new replica set, giving us 99% of the data immediately.

Synchronize ongoing changes

Now came the critical synchronization phase:

Planned maintenance window: We scheduled a brief downtime window for the delta backfill
Stopped write traffic: Applications were put into read-only or maintenance mode
Stopped delta store streaming: Halted the CDC pipeline capturing changes
Backfilled delta changes: Used mongoexport on the delta store and mongoimport into the target cluster in Cloud B to replay all changes that occurred during VM migration
Started fresh change stream: Pointed a new CDC pipeline from source (Cloud A) to target (Cloud B)
Resumed traffic: applications were brought back online, now writing to Cloud A while CDC kept Cloud B synchronized in real-time
Resumed traffic: Applications were brought back online, now writing to Cloud A while CDC kept Cloud B synchronized in real-time

This delta backfill phase required minimal downtime (typically minutes to a few hours depending on the volume of changes), but it ensured no writes were lost during the VM migration window.

Alternative: Zero-downtime approach

If a maintenance window is absolutely not an option, you could keep both the delta store CDC and the direct Cloud A → Cloud B change stream running simultaneously, then reconcile any overlapping changes. However, this comes with significant risks:

Duplicate operations: The same change might be applied twice
Race conditions: Concurrent updates to the same document from different streams
Complex conflict resolution: Requires sophisticated logic to handle out-of-order or conflicting operations
Data consistency challenges: Harder to guarantee correctness

We chose the maintenance window approach for its simplicity and strong consistency guarantees.

Cutover

We gradually shifted application traffic to Cloud B, enabled reverse streaming as a safety net, validated thoroughly, and decommissioned the database in Cloud A.

ALSO READ: From Slow to Swift: Revolutionizing Kubernetes StatefulSet Updates with OpenKruise

The Change Stream Script

We built a Python CDC pipeline deployed as Kubernetes deployments (one per collection):

Core logic:

Check if target DB/collection/indexes exist → create if missing
On insert → insert into target
On update → upsert into target (handles out-of-order events)
On replace → replace in target
On delete → delete from target
Store resume tokens in MongoDB for crash recovery

Deployment: Helm charts on K8s with ConfigMaps for MongoDB URIs and collection names.

Why this worked

✅ VM migration eliminated weeks of network transfer time
✅ Change streams captured every write during bulk migration
✅ Delta store ensured zero data loss
✅ Minimal downtime is only brief window for delta backfill
✅ Safe rollback for reverse streaming kept old cluster available