newsletter

Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

AVAILABLE NEWSLETTERS:

Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
November 20, 2018
prev slideNext slide

A Step-by-Step Replication Guide between On-Prem HDFS and Amazon Web Services

This blog was co-authored by Ryan Peterson, Head of Global Data Segment at AWS .

Central to empowering businesses to deliver the right data in the right environment to power the right use case is the ability for location-agnostic, secure replication by encapsulating and copying data seamlessly across physical private storage and public cloud environments.

Hortonworks’ Data Lifecycle Manager (DLM), an extensible service built on the Hortonworks DataPlane Platform (DPS) provides a complete solution to replicate HDFS, Hive data, metadata and security policies between on-premises and Amazon S3. This data movement enables Data science and ML workloads to execute models in Amazon SageMaker and bring back the successful data to on-premise. To facilitate this use case, here are the 3 steps for replication between HDFS to AWS cloud:

Step 1: Add Source Cluster for Replication

Using DPS, add an Ambari-managed HDP cluster. DPS provides information such as location of the cluster, number of nodes, uptime details of the cluster. (Fig 1). Identify and designate one of the clusters as source-cluster for the AWS-S3 cloud replication.

Fig 1: DPS Cluster UI with the list of clusters

1.1 Make sure that the logged-in user belongs to Dataplane Admin role (Fig 2)

1.2 From the top left corner, go to Data Lifecycle Manager app and click on the clusters page to see the list of clusters that can be used for replication. (Fig 3)

Fig 3: Navigate DLM from DPS

1.3. User is then directed to DLM Cluster Dashboard page to view clusters, location, status, usage, HDP and DLM versions, and nodes. There is also an option to go to Ambari page from the cluster menu option. (Fig 4)

Fig 4: DLM Cluster showing list of clusters for replication

Step 2:  Add Cloud Credentials

To replicate data to AWS S3, add cloud credentials either by using Key or Role based authentication methods. Click on the Cloud Credentials tab, click on Add button to add the cloud credentials. Ensure the credentials are validated before saving the credentials. (Fig 5.1, 5.2 and 5.3)

Fig 5.1: Validate Cloud Credentials
Fig 5.2: Save Cloud Credentials
Fig 5.3 : List of Cloud Credentials

Step 3: Create DLM Policy

 3.1 In the DLM navigation pane, click Policies. (Fig 6)

Fig 6: DLM Policies UI

3.2 The Replication Policies page displays a list of existing policies. Click “Add Policy”. (Fig 7)

Fig 7: “ADD POLICY” on the top right corner

3.3 Enter or select the following information:

  • Policy Name (Required) –> Enter the policy name of your choice
  • Service –> Select HDFS (Fig 8)
Fig 8: Create Replication Wizard – Select Service

3.4. On Select Source page –> Select Type as “Cluster” –> Select Source Cluster as one of the clusters that you added in the previous section (source cluster).  (Fig 9)

Fig 9: Create Replication Wizard – Select Source Cluster

3.5. Navigate using File browser and select the existing folder path on source cluster (e.g., /apps/traffic_data) – (Fig 10)

Fig 10: Create Replication Wizard – Select Source Folder
  • On the Select Destination page –> Select Type as S3 –> Select a cloud credential (Fig 11)
  • Select the path to S3 bucket. If the bucket exists, DLM will replicate the content to that bucket provided the cloud credentials have write-access to the bucket. If the bucket does not exist, DLM creates the bucket and replicates the content from source cluster.
  • On the Select Destination page –> Select Type as S3 –> Select a cloud credential – Select Path → Select Encryption Type. Two types of encryption protocols are supported – SSE-S3 and SSE-KMS. DLM overrides the S3 bucket encryption with the encryption selected in this step and click validate. (Fig 11)
  • Validate the path and click on “Schedule”. “Validate” ensures that the user has required file permissions to copy to the Destination cluster. (Fig 11)
Fig 11: Create Replication Wizard – Select Cloud as Target

3.6. In Run Job Section, click on “From Now” and enter the frequency of replication (e.g., Freq – 5 and select Minutes(s) – for demo purpose) and Click Advanced Settings. (Fig 12)

Fig 12: Create Replication Wizard – Schedule

3.7. Queue, Maximum bandwidth and Number of mappers are optional parameters. Set them if required for the replication. Click “Create Policy”. (Fig 13)

Fig 13: Create Replication Wizard – Advanced Settings

3.8. Once the policy is created successfully, you can see the alert Policy Created successfully.   (Fig 14)

Fig 14: Policy Created successfully

3.9. Now you can see the policy is successfully created and bootstrap is in progress. Bootstrap is defined as the first time the full copy of the dataset object from source to destination AWS-S3 bucket. The subsequent policy instance will execute based on incremental replication which copies only the changed/updated data from source to destination dataset. (Fig 15)

Fig 15: DLM Replication – Bootstrap is in progress

3.10. Once the bootstrap completes successfully, second DLM policy instance initiates incremental replication. (Fig 16)

Fig 16: DLM Replication policy with Incremental instance

Step 4. Review the replicated data in AWS-S3 buckets. You can see the data replicated from source cluster to AWS-S3 environment (Fig 17)

Fig 17: Dataset replicated in target S3 Bucket

Step 5. A critical step is to bring back successfully executed Data Science models on cloud to on-premise cluster. Create a DLM policy to bring back selected data (models and data) from AWS-S3 to on-premise cluster (Fig 18)

Fig 18: Dataset replicated from S3 Bucket to on-premise Cluster

Next Steps

We recommend checking Hortonworks DLM documentation site and presentation to learn details about DLM architecture and to see replication in action.

Visit Hortonworks at booth #629 in the Sands Expo Center during AWS re:Invent 2018 to learn more about Hortonworks, DLM and the winning combination of Big Data and Cloud.

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums