Today’s guest blogger is from Hortonworks Technology Partner, WANdisco. Peter Scott, SVP of Business Development and OEM Sales at WANdisco, talks about how to easily migrate from one Hadoop distribution to Hortonworks Data Platform (HDP).
Migration between Hadoop versions and distributions can be difficult, often causing extended downtime and disruption, unless you use the right tools. DistCp (distributed copy) is a tool available from Apache™ Hadoop® used for large inter/intra-cluster copying from Apache. The problem with unidirectional solutions built on DistCP is that you have to take the old cluster offline while migrating to the new. Customers find it extremely frustrating that access to data is restricted and normal operations can’t be maintained during the migration process.
Ideally, normal operations should continue during migration. Client applications and jobs should be able to run on the old cluster while the new one is brought online in parallel. Changes made on one cluster should be immediately available in the other. This would eliminate downtime in production environments and make a phased migration of users and applications possible without business disruption. This kind of migration experience can only be achieved with a true active-active replication solution that moves data with guaranteed consistency, without downtime or data loss.
With a well thought-out plan and WANdisco Fusion, migration from one distribution to another, whether the clusters are running on HDFS or HCFS compatible storage is possible without any disruption to service and regardless of whether the old and new clusters are in the same data center or thousands of miles apart.
A successful Hadoop cluster migration should include a minimum of three phases: planning, strategy definition and implementation.
Planning: This requires a full understanding of the impact migration will have on your organization from development, operations and business perspectives. Research is the first step in planning to get a complete handle on the likely impacts of migration and provide the input necessary to define an adequate test plan. Consult a broad range of sources to determine what is most important, pre- and post-migration.
Strategy Definition: The outcomes of the planning phase will guide the development of your migration strategy. Your strategy should include:
Implementation: This is typically performed in a series of logical steps:
Step 1: Establish the new cluster
The first step is to establish the new cluster’s environment and validate its correct implementation before migrating data to it. This new cluster can be used for DR during migration and the old cluster can be used for DR post-migration. However, with WANdisco Fusion’s patented replication technology, both clusters are fully active, read-write everywhere and recover automatically from each other after an outage.
Step 2: Migrate and test
WANdisco Fusion allows data transfer to take place while operations in both the old and new clusters continue as normal. You can test applications and compare results in both the old and new environments in parallel and validate that data has moved correctly. WANdisco Fusion can also move data selectively, so data no longer needed post- migration is not moved. If network or server outages occur during migration, WANdisco Fusion has built-in recovery features that resync the clusters automatically without administrators having to do anything.
Step 3: Adopt new cluster
This must incorporate feedback on the actual results set against the expectations and outcomes defined in your migration strategy. Share your experience as broadly as possible to allow other groups or organizations to benefit from it.
If you have hardware and other infrastructure from the old cluster available post-migration you can implement WANdisco Fusion in your new environment and take advantage of features that were huge benefits during migration. These include patented active-active replication enabling read-write access to the same data on every cluster regardless of distance between them so that you can ingest and analyze data anywhere, automated disaster recovery and 100% use of your hardware, without wasting 50% of your hardware budget on idle backup servers.
The risks and rewards of a Hadoop migration make it important to select and take full advantage of the best technologies and processes. WANdisco Fusion provides an unsurpassed level of control, automation and risk minimization for cluster migration, allowing you to mix and match Hadoop distributions, physical locations and cluster capacities with ease.
Please reach out to me if you would like more details on the above framework and steps involved in migration to HDP.