Home Forums Ambari s3 for hdfs

Tagged: , ,

This topic contains 2 replies, has 2 voices, and was last updated by  Steve Loughran 2 months ago.

  • Creator
    Topic
  • #56697

    Brian Brady
    Participant

    Hello fair internet people

    I have a fully functioning 5 node Ambari cluster setup in AWS.
    I am now trying to follow https://wiki.apache.org/hadoop/AmazonS3 to replace my hdfs with s3.

    In my Ambari setup, I clicked on HDFS then the config tab.
    In the Advanced section I found the property for
    fs.defaultFS
    and changed it from hdfs://ip-xx-xx-xx-xx.compute.internal:8020
    to s3://bucket-name/
    Then I added
    fs.s3.awsAccessKeyId
    and
    fs.s3.awsSecretAccessKey
    with their values to the hdfs-site.xml section.

    I presume this is essentially adding all these relevent values into the hdfs-site.xml config file on the server.

    So, when I restart the nodes, I now get the namenode not restarting with the following error:

    safemode: FileSystem s3://bucket-name is not an HDFS file system
    Usage: java DFSAdmin [-safemode enter | leave | get | wait]
    2014-06-30 09:56:25,604 – Retrying after 10 seconds. Reason: Execution of ‘su – hdfs -c ‘hadoop dfsadmin -safemode get’ | grep ‘Safe mode is OFF” returned 1. DEPRECATED: Use of this script to execute hdfs command is deprecated.Instead use the hdfs command for it.

    Do I have to configure something else somewhere to get hdfs to like using s3? Or to get Ambari specifically to use it?

    Thanks in advance
    Brian

Viewing 2 replies - 1 through 2 (of 2 total)

You must be logged in to reply to this topic.

  • Author
    Replies
  • #56780

    Steve Loughran
    Participant

    Brian,

    I’m afraid you can’t swap out HDFS for amazon S3, as the latter is an eventually consistent object store with directory operations simulated in the Hadoop client, -while the filesystem client pretends that it is an FS, it isn’t really and things will break. Nobody does this -and the fact that a bit of the Hadoop wiki says you can is an error that someone should have corrected a long time ago. Ambari is just picking up on this problem early as it assumes that the FS is HDFS and tries to run some startup operations against it -but even without that, things will go wrong later.

    I have just updated the Apache wiki page to explain why you can’t do this. If you want more details and really, really want to try getting it to work, look at what Netflix had to do with S3mper and consider whether it is worth it. It is for them as they run many dynamic hadoop clusters and direct the output of every hadoop sequence to S3, then direct queries to any one of these that has capacity. -if you have one cluster you don’t need to do this.

    What you can do instead is start up HDFS in your EC2 hosted filesystem, and use the S3 storage as the place to read data from and write at the end of a workflow. Intermediate data should live in HDFS as it is faster to access.

    Collapse
    #56779

    Brian Brady
    Participant

    No one?
    Has anyone gotten S3 to work with HDFS in general, ignoring the Ambari component, and I can start from there?

    Collapse
Viewing 2 replies - 1 through 2 (of 2 total)