The Hortonworks Community Connection is now live. A completely rebuilt Q&A forum, Knowledge Base, Code Hub and more, backed by the experts in the industry.

You will be redirected here in 10 seconds. If your are not redirected, click here to visit the new site.

The legacy Hortonworks Forum is now closed. You can view a read-only version of the former site by clicking here. The site will be taken offline on January 31,2016

Ambari Forum

s3 for hdfs

  • #56697
    Brian Brady

    Hello fair internet people

    I have a fully functioning 5 node Ambari cluster setup in AWS.
    I am now trying to follow to replace my hdfs with s3.

    In my Ambari setup, I clicked on HDFS then the config tab.
    In the Advanced section I found the property for
    and changed it from hdfs://ip-xx-xx-xx-xx.compute.internal:8020
    to s3://bucket-name/
    Then I added
    with their values to the hdfs-site.xml section.

    I presume this is essentially adding all these relevent values into the hdfs-site.xml config file on the server.

    So, when I restart the nodes, I now get the namenode not restarting with the following error:

    safemode: FileSystem s3://bucket-name is not an HDFS file system
    Usage: java DFSAdmin [-safemode enter | leave | get | wait]
    2014-06-30 09:56:25,604 – Retrying after 10 seconds. Reason: Execution of ‘su – hdfs -c ‘hadoop dfsadmin -safemode get’ | grep ‘Safe mode is OFF” returned 1. DEPRECATED: Use of this script to execute hdfs command is deprecated.Instead use the hdfs command for it.

    Do I have to configure something else somewhere to get hdfs to like using s3? Or to get Ambari specifically to use it?

    Thanks in advance

  • Author
  • #56779
    Brian Brady

    No one?
    Has anyone gotten S3 to work with HDFS in general, ignoring the Ambari component, and I can start from there?

    Steve Loughran


    I’m afraid you can’t swap out HDFS for amazon S3, as the latter is an eventually consistent object store with directory operations simulated in the Hadoop client, -while the filesystem client pretends that it is an FS, it isn’t really and things will break. Nobody does this -and the fact that a bit of the Hadoop wiki says you can is an error that someone should have corrected a long time ago. Ambari is just picking up on this problem early as it assumes that the FS is HDFS and tries to run some startup operations against it -but even without that, things will go wrong later.

    I have just updated the Apache wiki page to explain why you can’t do this. If you want more details and really, really want to try getting it to work, look at what Netflix had to do with S3mper and consider whether it is worth it. It is for them as they run many dynamic hadoop clusters and direct the output of every hadoop sequence to S3, then direct queries to any one of these that has capacity. -if you have one cluster you don’t need to do this.

    What you can do instead is start up HDFS in your EC2 hosted filesystem, and use the S3 storage as the place to read data from and write at the end of a workflow. Intermediate data should live in HDFS as it is faster to access.

    Manne Laukkanen

    Brian: Steve:
    “Q: What data consistency model does Amazon S3 employ?Amazon S3 buckets in the US Standard region provide eventual consistency. Amazon S3 buckets in all other regions provide read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES. ”
    Is this a mistaken claim?

    We have our buckets outside US Standard region.

    If we did a flow of source->create object in S3 to a landing area -> read from that, create object in storage area -> read from that… you say that this would result in data inconsistency?

The forum ‘Ambari’ is closed to new topics and replies.

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.