This blog was co-authored by Ryan Peterson, Head of Global Data Segment at AWS .
Central to empowering businesses to deliver the right data in the right environment to power the right use case is the ability for location-agnostic, secure replication by encapsulating and copying data seamlessly across physical private storage and public cloud environments.
Hortonworks’ Data Lifecycle Manager (DLM), an extensible service built on the Hortonworks DataPlane Platform (DPS) provides a complete solution to replicate HDFS, Hive data, metadata and security policies between on-premises and Amazon S3. This data movement enables Data science and ML workloads to execute models in Amazon SageMaker and bring back the successful data to on-premise. To facilitate this use case, here are the 3 steps for replication between HDFS to AWS cloud:
Using DPS, add an Ambari-managed HDP cluster. DPS provides information such as location of the cluster, number of nodes, uptime details of the cluster. (Fig 1). Identify and designate one of the clusters as source-cluster for the AWS-S3 cloud replication.
1.1 Make sure that the logged-in user belongs to Dataplane Admin role (Fig 2)
1.2 From the top left corner, go to Data Lifecycle Manager app and click on the clusters page to see the list of clusters that can be used for replication. (Fig 3)
1.3. User is then directed to DLM Cluster Dashboard page to view clusters, location, status, usage, HDP and DLM versions, and nodes. There is also an option to go to Ambari page from the cluster menu option. (Fig 4)
To replicate data to AWS S3, add cloud credentials either by using Key or Role based authentication methods. Click on the Cloud Credentials tab, click on Add button to add the cloud credentials. Ensure the credentials are validated before saving the credentials. (Fig 5.1, 5.2 and 5.3)
3.1 In the DLM navigation pane, click Policies. (Fig 6)
3.2 The Replication Policies page displays a list of existing policies. Click “Add Policy”. (Fig 7)
3.3 Enter or select the following information:
3.4. On Select Source page –> Select Type as “Cluster” –> Select Source Cluster as one of the clusters that you added in the previous section (source cluster). (Fig 9)
3.5. Navigate using File browser and select the existing folder path on source cluster (e.g., /apps/traffic_data) – (Fig 10)
3.6. In Run Job Section, click on “From Now” and enter the frequency of replication (e.g., Freq – 5 and select Minutes(s) – for demo purpose) and Click Advanced Settings. (Fig 12)
3.7. Queue, Maximum bandwidth and Number of mappers are optional parameters. Set them if required for the replication. Click “Create Policy”. (Fig 13)
3.8. Once the policy is created successfully, you can see the alert Policy Created successfully. (Fig 14)
3.9. Now you can see the policy is successfully created and bootstrap is in progress. Bootstrap is defined as the first time the full copy of the dataset object from source to destination AWS-S3 bucket. The subsequent policy instance will execute based on incremental replication which copies only the changed/updated data from source to destination dataset. (Fig 15)
3.10. Once the bootstrap completes successfully, second DLM policy instance initiates incremental replication. (Fig 16)
Visit Hortonworks at booth #629 in the Sands Expo Center during AWS re:Invent 2018 to learn more about Hortonworks, DLM and the winning combination of Big Data and Cloud.