I think the live migration of the Namenode is something you would need to “tread carefully” on. It is very much something that’s not been tested and is a key risk point.
We have done a lot of work and testing on using Linux HA for handling NN failover -this uses a floating IP address and will mount and remount the NFS drive as it moves the process.
* you can bring up a Linux HA cluster as VMs -it’s one way we did a lot of testing of failures.
* you must have a “STONITH” mechanism for one of the VMs to kill the other one if they ever lose contact. Normally that’s a network/serial port addressable power supply. In the VM world, a shell script that sshexec’s a command on the host server to power off the other VM can be used.
Even in this world you have to make sure that switches and routing keeps up with the floating IP address -the same problem you have with live VM migration. Keep all the failover nodes on the same switch, so that the ARP address for off-switch systems stays the same, and hope that ARP refresh message gets round all the hosts in the local switch.