Migrating Terabytes of MongoDB Across Clouds with Minimal Disruption

Migrating terabytes of MongoDB across clouds with minimal disruption

Migrating terabytes of live MongoDB data across cloud providers is notoriously difficult. Standard approaches like extending replica sets across clouds fail at scale where initial syncs take weeks, oplogs roll over, and you're stuck in an endless retry loop.

At Haptik, we needed to migrate a multi-terabyte MongoDB replica set from Cloud A to Cloud B with minimal disruption. Here's how we did it.

The Problem

Traditional MongoDB migrations rely on initial sync, but at terabyte scale:

  • Data transfer takes days or weeks
  • Oplogs roll over before sync completes
  • Continuous failed syncs drain resources
  • You can't afford extended downtime

Our Solution: VM Migration + Change Data Capture

We combined three techniques:

  • VM-based seeding: Migrating an existing secondary node
  • Change streams: Capturing ongoing changes in real-time
  • Delta replay: Small maintenance window to backfill missed changes

The Migration Process

Here's a high-level view of our setup.

Mongo Migration

 

Add a hidden secondary

We added a hidden, non-electing secondary (priority: 0, votes: 0) to the source replica set. This became our migration seed without affecting production.

Deploy delta store

We set up a temporary MongoDB instance with a Python script using change streams to capture every insert/update/delete from source collections while bulk migration happened.

Critical monitoring

We configured alerts to ensure no data loss

  • Change stream lag monitoring tracked the time difference between source writes and delta store captures with scheduled scripts
  • Resume token health checks ensured tokens were being persisted correctly 
  • Delta store disk space alerts prevented storage exhaustion during high-write periods 
  • Script Kubernetes liveness probes to automatically restart failed CDC pods 
  • Missing operation detection for periodic reconciliation checks comparing source and delta store document counts

RELATED: How We Took Our Kubernetes Autoscaling from Basic to Advanced Mode with Istio Metrics

Migrate the seed node

We gracefully stopped the hidden secondary and used cloud VM migration services to clone the entire VM (with data) to Cloud B. This transferred terabytes at disk speeds, bypassing network bottlenecks.

Build target replica set

We brought up the migrated VM in Cloud B and used it to bootstrap a new replica set, giving us 99% of the data immediately.

Synchronize ongoing changes

Now came the critical synchronization phase:

  • Planned maintenance window: We scheduled a brief downtime window for the delta backfill
  • Stopped write traffic: Applications were put into read-only or maintenance mode
  • Stopped delta store streaming: Halted the CDC pipeline capturing changes
  • Backfilled delta changes: Used mongoexport on the delta store and mongoimport into the target cluster in Cloud B to replay all changes that occurred during VM migration
  • Started fresh change stream: Pointed a new CDC pipeline from source (Cloud A) to target (Cloud B)
    Resumed traffic: applications were brought back online, now writing to Cloud A while CDC kept Cloud B synchronized in real-time
  • Resumed traffic: Applications were brought back online, now writing to Cloud A while CDC kept Cloud B synchronized in real-time

This delta backfill phase required minimal downtime (typically minutes to a few hours depending on the volume of changes), but it ensured no writes were lost during the VM migration window. 

Alternative: Zero-downtime approach

If a maintenance window is absolutely not an option, you could keep both the delta store CDC and the direct Cloud A → Cloud B change stream running simultaneously, then reconcile any overlapping changes. However, this comes with significant risks:

  • Duplicate operations: The same change might be applied twice
  • Race conditions: Concurrent updates to the same document from different streams
  • Complex conflict resolution: Requires sophisticated logic to handle out-of-order or conflicting operations
  • Data consistency challenges: Harder to guarantee correctness

We chose the maintenance window approach for its simplicity and strong consistency guarantees.

Cutover

We gradually shifted application traffic to Cloud B, enabled reverse streaming as a safety net, validated thoroughly, and decommissioned the database in Cloud A.

ALSO READ: From Slow to Swift: Revolutionizing Kubernetes StatefulSet Updates with OpenKruise

The Change Stream Script

We built a Python CDC pipeline deployed as Kubernetes deployments (one per collection):

Core logic:

  • Check if target DB/collection/indexes exist → create if missing
  • On insert → insert into target
  • On update → upsert into target (handles out-of-order events)
  • On replace → replace in target
  • On delete → delete from target
  • Store resume tokens in MongoDB for crash recovery

Deployment: Helm charts on K8s with ConfigMaps for MongoDB URIs and collection names.

Why this worked

 ✅ VM migration eliminated weeks of network transfer time
 ✅ Change streams captured every write during bulk migration
 ✅ Delta store ensured zero data loss
 ✅ Minimal downtime is only brief window for delta backfill
 ✅ Safe rollback for reverse streaming kept old cluster available

Key challenges

  • Oplog sizing: ensuring oplog wouldn't roll over during VM migration
  • Delta backfill optimization: parallel exports/imports, optimal batch sizes
  • Network bandwidth: CDC pipelines consumed significant bandwidth
  • Resume token persistence: storing tokens in MongoDB for reliability
  • Manual index replication: change streams don't capture DDL

Lessons learned

  • Hidden secondaries are perfect for seeding large migrations
  • VM migration services save weeks of transfer time
  • Change streams are production-ready for CDC workloads
  • Always have a rollback plan: Reverse streaming gave us confidence
  • Test in staging first: Uncovered critical edge cases

The outcome

  • Migrated terabytes of MongoDB data
  • Minimal downtime: only brief maintenance window
  • Zero data loss
  • Improved performance and latency in Cloud B
  • Reduced infrastructure costs

When to Use This

This approach makes sense when:

  • You have terabyte-scale datasets
  • You can afford a brief maintenance window (minutes to hours)
  • You have access to VM migration services
  • You have Kubernetes for CDC deployments

For datasets under 100GB, standard replica set extension is simpler.

At terabyte scale, textbook approaches don't work. Sometimes you need to write your own playbook.

 

Related Articles

View All