-->

New EMC Product Enables VM Mobility

Yesterday I read an interesting blog post about a new product from EMC. The benefit of the product is that it enables very fast long-distance VM (virtual machine) mobility. But at the core, it’s a storage replication engine that focuses on “Distributed Cache Coherency”.

The VM mobility is impressive. To quote the blog post:

At EMC World today we vMotioned 100 VMs between clusters in 4 minutes and 500 in 20 minutes. That’s 2.4 seconds per VM. Across the equivalent of 100KM.

Clearly this wouldn’t be possible if the VM had to go through a typical “VMotion”, which involves (in simplified terms) pausing the VM, writing active memory to a file, moving the disk image and memory file to a remote server, and resuming the VM. The pausing and resuming aren’t time consuming but the movement of data across a network can be the source of significant delay. That’s where EMC does a little “cheating”. (or magic, depending on your point of view)

I don’t have first-hand knowledge of the platform. But I would guess that they’re keeping data mostly in-sync all the time, using background replication, so that the VM doesn’t need as much copied over the network at the point of migration. Even more importantly, they don’t even have to copy all of the data across the network before starting to use it. The platform copies a much smaller directory structure that points to the local drives for local data or the remote drives for data that hasn’t been copied yet. Upon accessing remote data it gets copied into local storage, which has a negative impact on performance. But the performance hit only affects that block at that moment; accessing any already-replicated blocks will run at native speeds.

Here is a diagram borrowed from the EMC posting to illustrate the idea:

vplex-DCC-GIF

As I said above, this is just my guess. The article goes into enough detail to get a rough picture of what happens. But I’m looking forward to hearing more details about this platform. I’m especially interested in hearing about conflict resolution. The article touches on this by saying:

The key is that VPLEX doesn’t solve the “what if two updates to the same block occur at the same moment?” problem (this is the core of the “did you solve the speed of light problem” question). The key is that MANY use cases ensure a single “writer” (single host writing to block at any given time), we don’t have to solve the “speed of light” problem (working on it :-) In the VMware use case, for example a single ESX host is the “writer” for a VM at any given moment, and a vMotion is an atomic operation – at one moment, one host is writing to a set of blocks, and then at next moment, another host is writing.

But the devil’s advocate in me still has to ask: but, what if?!? I’m sure I’ll have the chance to ask EMC someday soon. But regardless of their answer, I’m happy to see a platform like this one. EMC is uniquely positioned (given their relationship with / ownership of VMware etc) to make the assumptions and develop the tools that we’re going to need for any reasonable future in the cloud.

blog comments powered by Disqus