Importing snapshots from Amazon S3 to HBase

to create new topics or reply. | New User Registration

This topic contains 8 replies, has 3 voices, and was last updated by  Dale Bradman 3 months ago.

  • Creator
    Topic
  • #58424

    techops_korrelate
    Participant

    Hello,

    We created snapshots and exported them to S3 using the Snapshot Export tool. We are trying to figure out how to import them into HBase so that they are a) visible as snapshots and b) can be cloned into a viable table.

    To export (as hbase user):
    hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot '$snapshotname' -copy-to s3n://$bucket_name/$snapshotname -mappers 4,
    with AWS credentials already configured in HDFS.

    We’re having trouble importing them to another cluster. There is no “Snapshot Import” tool. We’ve been attempting to use hadoop distcp to copy from S3 to the target HDFS:
    hadoop distcp s3n://s3n://$bucket_name/$snapshotname /apps/hbase/data/.hbase-snapshot
    But the import includes a different file structure than we see when we export a snapshot between clusters. Snapshots either a) don’t appear or b) are corrupted and cannot be cloned.

    Please help indicate the correct import path when copying an exported Snapshot from a DFS source.

    Relevant information:
    HDP: 2.1
    HBase: 0.98.0.2.1

Viewing 8 replies - 1 through 8 (of 8 total)

You must be to reply to this topic. | Create Account

  • Author
    Replies
  • #70896

    Dale Bradman
    Participant

    Further update:

    What is happening is that the folder structure is getting written inside a virtual folder in the bucket. I am aware that s3 has no concept of folders but that is how it appears in file browser UI.

    Once the job has failed, the path of the folder is <BUCKET_NAME>//2HBASE-SNAP-X . Notice the double forward slash there which is different to the job that is trying to write to <BUCKET_NAME>/2HBASE-SNAP-X

    Why is this virtual folder being created and how can I get it to write to the correct path?

    Collapse
    #70731

    Dale Bradman
    Participant

    An update:

    In my S3 Bucket I can see a folder structure for the “2HBASE-SNAP_X” Snapshot however there is nothing actually written to it….

    The process fails after 2015-04-27 08:59:27,305 INFO [main] mapreduce.Job: map 0% reduce 0%

    Collapse
    #70730

    Dale Bradman
    Participant

    Hello, I am having an issue with exporting to S3, wondering if you could give any advice….

    I get an error saying:
    2015-04-27 05:39:49,547 INFO [IPC Server handler 0 on 40333] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from attempt_1429544880663_0004_m_000000_0: Error: java.io.IOException: Could not get the output FileSystem with root=s3n://<ACCESS_KEY_ID>:<SECRET_ACCESS_KEY>@<BUCKET_NAME>/2HBASE-SNAP_X
    at org.apache.hadoop.hbase.snapshot.ExportSnapshot$ExportMapper.setup(ExportSnapshot.java:149)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
    Caused by: java.io.IOException: No FileSystem for scheme: s3n
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
    at org.apache.hadoop.hbase.snapshot.ExportSnapshot$ExportMapper.setup(ExportSnapshot.java:147)
    ... 8 more

    My MapReduce isn’t my strongest and I have a feeling it could be to do with not specifying an outpath for the mappers?

    The code I use to export the snapshot is:

    hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot "SNAP_X" -copy-to s3n://<ACCESS_KEY_ID>:<SECRET_ACCESS_KEY>@<BUCKET_NAME>/2HBASE-SNAP_X -mappers 3

    Collapse
    #59395

    John Cooper
    Participant

    Managed to get s3n:// import to work using this tool but not s3 block import. I’m looking at forking the tool and producing a howto guide.

    Collapse
    #59113

    John Cooper
    Participant

    Found problem, it was the S3 role was missing. I thought adding authentication would be enough but only allows to copy in but cannot copy out. So now export snapshot is successful and the snapshot-s3-util export now works. The import fails using s3 block store :-

    sudo -u hbase HADOOP_CLASSPATH=YOURHADDOPPATH/lib/hbase/lib/* hadoop jar target/snapshot-s3-util-1.0.0.jar com.imgur.backup.SnapshotS3Util –import –snapshot test5-snapshot-20140822_090717 -d /hbase -k key -s secret –bucketName mybucket
    14/08/22 09:12:02 WARN security.UserGroupInformation: PriviledgedActionException as:hbase (auth:SIMPLE) cause:org.apache.hadoop.fs.UnsupportedFileSystemException: No AbstractFileSystem for scheme: s3

    and trying s3n it doesn’t pickup the secret key.

    sudo -u hbase HADOOP_CLASSPATH=YOURHADOOPPATH/lib/hbase/lib/* hadoop jar target/snapshot-s3-util-2.0.0.jar com.imgur.backup.SnapshotS3Util –import –snapshot test1-snapshot-20140822_101514 -d /hbase -a true -k key -s secret –bucketName mybucket

    java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey

    Will see if can fix this.

    Collapse
    #59043

    John Cooper
    Participant

    It looks like the map reduce (yarn) is failing to move/rename the temporary files it creates on s3. Not sure if this is an issue with moving/renaming files/directories in S3. Works fine exporting to a normal file system using file:/// and can then use “aws s3 cp sourcefolder s3://mybucket/sourcefolder –recursive” to copy the archive and .snapshot folders to S3. The import using distcp should work as long as the .snapshot and archive folders are copied in to the hbase root in hdfs (/hbase). I’ve managed to use snapshot export to another hbase cluster and then from the other cluster use snapshot export to copy the files back. The hbase restore_snapshot worked fine.

    Collapse
    #58999

    John Cooper
    Participant

    Hi, I’ve tried this command on same version of Hortonworks and Cloudera but the export to S3 fails because the snapshot info is missing. Actual data in archive directory is there. Anything special how you setup S3? I am also working to get the import working and trying https://github.com/lospro7/snapshot-s3-util which is a wrapper for the export snapshot command. I’ve compiled it on 0.98 but is failing due to the missing snapshot info. Once I get that fix I am sure the util will run ok.

    org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: Couldn’t read snapshot info from:s3n://key:secret@mybucket/hbase/.hbase-snapshot/test3s1/.snapshotinfo

    Hortonworks doesn’t complain but the .snapshotinfo is missing just the same.

    Also need to get s3:// auth working as s3n:// has 5TB limit.

    Thanks, John.

    Collapse
    #58530

    techops_korrelate
    Participant

    Bump! Any thoughts?

    Collapse
Viewing 8 replies - 1 through 8 (of 8 total)
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.