s3 for hdfs

to create new topics or reply. | New User Registration

This topic contains 3 replies, has 3 voices, and was last updated by  Manne Laukkanen 4 months, 1 week ago.

  • Creator
    Topic
  • #56697

    Brian Brady
    Participant

    Hello fair internet people

    I have a fully functioning 5 node Ambari cluster setup in AWS.
    I am now trying to follow https://wiki.apache.org/hadoop/AmazonS3 to replace my hdfs with s3.

    In my Ambari setup, I clicked on HDFS then the config tab.
    In the Advanced section I found the property for
    fs.defaultFS
    and changed it from hdfs://ip-xx-xx-xx-xx.compute.internal:8020
    to s3://bucket-name/
    Then I added
    fs.s3.awsAccessKeyId
    and
    fs.s3.awsSecretAccessKey
    with their values to the hdfs-site.xml section.

    I presume this is essentially adding all these relevent values into the hdfs-site.xml config file on the server.

    So, when I restart the nodes, I now get the namenode not restarting with the following error:

    safemode: FileSystem s3://bucket-name is not an HDFS file system
    Usage: java DFSAdmin [-safemode enter | leave | get | wait]
    2014-06-30 09:56:25,604 – Retrying after 10 seconds. Reason: Execution of ‘su – hdfs -c ‘hadoop dfsadmin -safemode get’ | grep ‘Safe mode is OFF” returned 1. DEPRECATED: Use of this script to execute hdfs command is deprecated.Instead use the hdfs command for it.

    Do I have to configure something else somewhere to get hdfs to like using s3? Or to get Ambari specifically to use it?

    Thanks in advance
    Brian

Viewing 3 replies - 1 through 3 (of 3 total)

You must be to reply to this topic. | Create Account

  • Author
    Replies
  • #69027

    Manne Laukkanen
    Participant

    Brian: Steve:
    http://aws.amazon.com/s3/faqs/
    “Q: What data consistency model does Amazon S3 employ?Amazon S3 buckets in the US Standard region provide eventual consistency. Amazon S3 buckets in all other regions provide read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES. ”
    Is this a mistaken claim?

    We have our buckets outside US Standard region.

    If we did a flow of source->create object in S3 to a landing area -> read from that, create object in storage area -> read from that… you say that this would result in data inconsistency?

    Collapse
    #56780

    Steve Loughran
    Participant

    Brian,

    I’m afraid you can’t swap out HDFS for amazon S3, as the latter is an eventually consistent object store with directory operations simulated in the Hadoop client, -while the filesystem client pretends that it is an FS, it isn’t really and things will break. Nobody does this -and the fact that a bit of the Hadoop wiki says you can is an error that someone should have corrected a long time ago. Ambari is just picking up on this problem early as it assumes that the FS is HDFS and tries to run some startup operations against it -but even without that, things will go wrong later.

    I have just updated the Apache wiki page to explain why you can’t do this. If you want more details and really, really want to try getting it to work, look at what Netflix had to do with S3mper and consider whether it is worth it. It is for them as they run many dynamic hadoop clusters and direct the output of every hadoop sequence to S3, then direct queries to any one of these that has capacity. -if you have one cluster you don’t need to do this.

    What you can do instead is start up HDFS in your EC2 hosted filesystem, and use the S3 storage as the place to read data from and write at the end of a workflow. Intermediate data should live in HDFS as it is faster to access.

    Collapse
    #56779

    Brian Brady
    Participant

    No one?
    Has anyone gotten S3 to work with HDFS in general, ignoring the Ambari component, and I can start from there?

    Collapse
Viewing 3 replies - 1 through 3 (of 3 total)
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.