I’m afraid you can’t swap out HDFS for amazon S3, as the latter is an eventually consistent object store with directory operations simulated in the Hadoop client, -while the filesystem client pretends that it is an FS, it isn’t really and things will break. Nobody does this -and the fact that a bit of the Hadoop wiki says you can is an error that someone should have corrected a long time ago. Ambari is just picking up on this problem early as it assumes that the FS is HDFS and tries to run some startup operations against it -but even without that, things will go wrong later.
I have just updated the Apache wiki page to explain why you can’t do this. If you want more details and really, really want to try getting it to work, look at what Netflix had to do with S3mper and consider whether it is worth it. It is for them as they run many dynamic hadoop clusters and direct the output of every hadoop sequence to S3, then direct queries to any one of these that has capacity. -if you have one cluster you don’t need to do this.
What you can do instead is start up HDFS in your EC2 hosted filesystem, and use the S3 storage as the place to read data from and write at the end of a workflow. Intermediate data should live in HDFS as it is faster to access.