Migrating Terabytes of MongoDB Across Clouds with Minimal Disruption

Migrating terabytes of live MongoDB data across cloud providers is notoriously difficult. Standard approaches like extending replica sets across clouds fail at scale where initial syncs take weeks, oplogs roll over, and you're stuck in an endless retry loop.
At Haptik, we needed to migrate a multi-terabyte MongoDB replica set from Cloud A to Cloud B with minimal disruption. Here's how we did it.
The Problem
Traditional MongoDB migrations rely on initial sync, but at terabyte scale:
- Data transfer takes days or weeks
- Oplogs roll over before sync completes
- Continuous failed syncs drain resources
- You can't afford extended downtime
Our Solution: VM Migration + Change Data Capture
We combined three techniques:
- VM-based seeding: Migrating an existing secondary node
- Change streams: Capturing ongoing changes in real-time
- Delta replay: Small maintenance window to backfill missed changes
The Migration Process
Here's a high-level view of our setup.
Add a hidden secondary
We added a hidden, non-electing secondary (priority: 0, votes: 0) to the source replica set. This became our migration seed without affecting production.
Deploy delta store
We set up a temporary MongoDB instance with a Python script using change streams to capture every insert/update/delete from source collections while bulk migration happened.
Critical monitoring
We configured alerts to ensure no data loss
- Change stream lag monitoring tracked the time difference between source writes and delta store captures with scheduled scripts
- Resume token health checks ensured tokens were being persisted correctly
- Delta store disk space alerts prevented storage exhaustion during high-write periods
- Script Kubernetes liveness probes to automatically restart failed CDC pods
- Missing operation detection for periodic reconciliation checks comparing source and delta store document counts
RELATED: How We Took Our Kubernetes Autoscaling from Basic to Advanced Mode with Istio Metrics
Migrate the seed node
We gracefully stopped the hidden secondary and used cloud VM migration services to clone the entire VM (with data) to Cloud B. This transferred terabytes at disk speeds, bypassing network bottlenecks.
Build target replica set
We brought up the migrated VM in Cloud B and used it to bootstrap a new replica set, giving us 99% of the data immediately.
Synchronize ongoing changes
Now came the critical synchronization phase:
- Planned maintenance window: We scheduled a brief downtime window for the delta backfill
- Stopped write traffic: Applications were put into read-only or maintenance mode
- Stopped delta store streaming: Halted the CDC pipeline capturing changes
- Backfilled delta changes: Used mongoexport on the delta store and mongoimport into the target cluster in Cloud B to replay all changes that occurred during VM migration
- Started fresh change stream: Pointed a new CDC pipeline from source (Cloud A) to target (Cloud B)
Resumed traffic: applications were brought back online, now writing to Cloud A while CDC kept Cloud B synchronized in real-time - Resumed traffic: Applications were brought back online, now writing to Cloud A while CDC kept Cloud B synchronized in real-time
This delta backfill phase required minimal downtime (typically minutes to a few hours depending on the volume of changes), but it ensured no writes were lost during the VM migration window.
Alternative: Zero-downtime approach
If a maintenance window is absolutely not an option, you could keep both the delta store CDC and the direct Cloud A → Cloud B change stream running simultaneously, then reconcile any overlapping changes. However, this comes with significant risks:
- Duplicate operations: The same change might be applied twice
- Race conditions: Concurrent updates to the same document from different streams
- Complex conflict resolution: Requires sophisticated logic to handle out-of-order or conflicting operations
- Data consistency challenges: Harder to guarantee correctness
We chose the maintenance window approach for its simplicity and strong consistency guarantees.
Cutover
We gradually shifted application traffic to Cloud B, enabled reverse streaming as a safety net, validated thoroughly, and decommissioned the database in Cloud A.
ALSO READ: From Slow to Swift: Revolutionizing Kubernetes StatefulSet Updates with OpenKruise
The Change Stream Script
We built a Python CDC pipeline deployed as Kubernetes deployments (one per collection):
Deployment: Helm charts on K8s with ConfigMaps for MongoDB URIs and collection names. |
Why this worked
✅ VM migration eliminated weeks of network transfer time
✅ Change streams captured every write during bulk migration
✅ Delta store ensured zero data loss
✅ Minimal downtime is only brief window for delta backfill
✅ Safe rollback for reverse streaming kept old cluster available
Key challenges
- Oplog sizing: ensuring oplog wouldn't roll over during VM migration
- Delta backfill optimization: parallel exports/imports, optimal batch sizes
- Network bandwidth: CDC pipelines consumed significant bandwidth
- Resume token persistence: storing tokens in MongoDB for reliability
- Manual index replication: change streams don't capture DDL
Lessons learned
- Hidden secondaries are perfect for seeding large migrations
- VM migration services save weeks of transfer time
- Change streams are production-ready for CDC workloads
- Always have a rollback plan: Reverse streaming gave us confidence
- Test in staging first: Uncovered critical edge cases
The outcome
- Migrated terabytes of MongoDB data
- Minimal downtime: only brief maintenance window
- Zero data loss
- Improved performance and latency in Cloud B
- Reduced infrastructure costs
When to Use This
This approach makes sense when:
- You have terabyte-scale datasets
- You can afford a brief maintenance window (minutes to hours)
- You have access to VM migration services
- You have Kubernetes for CDC deployments
For datasets under 100GB, standard replica set extension is simpler.
At terabyte scale, textbook approaches don't work. Sometimes you need to write your own playbook.