Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
June 29, 2017
prev slideNext slide

Using S3Guard for Amazon S3 Consistency

This blog has contributions from Mingliang Liu and Rajesh Balamohan.

Late last year, we provided a brief history of Apache Hadoop support for Amazon S3. Our first focus of work was speeding up the read of S3-hosted data acting as a query input. That was followed by the write pipeline, as well as scaling and security. There is one big missing item remaining: using S3 as the direct destination of work. The fact that the API gives it the appearance of a filesystem means that people can try to use it instead of HDFS as the destination of Hive, Spark and MapReduce queries. This is something which appears to work, albeit slowly, but which is insidiously dangerous (due to the S3 eventual consistency model which is not the same requirement as a filesystem). In this blog, we’ll cover S3Guard and how it helps eliminates this consistency situation.

The Problem – S3 Consistency

Today, using Amazon S3 as the direct output filesystem for queries is dangerous, as the eventually-consistent nature of the store means that invalid results may be generated. This is more likely with larger datasets and longer queries — the kind used in production, rather than development.

It’s generally using Amazon S3 as the direct output filesystem (via DistCp tool), or as a way of processing asynchronously collected data is fine.  But not — yet — as the direct destination of work. Nevertheless, our stance is currently , “don’t”.  Instead, work with HDFS and copy up the data afterwards.

using Amazon S3 as the direct output filesystem

using Amazon S3 as the direct output filesystem

To understand this, let’s look at the Amazon S3 consistency model. According to its official documentation, for all regions, S3 provides read-after-write consistency for the uploading of new objects (with caveats); it offers eventual consistency for overwritten/updated objects, and object deletion. For all operations, there may be a lag in the change in state of the object, and the value returned in listing operations.

This results in three very significant differences between the consistency of a filesystem and that of S3:

  1. List Inconsistency. When listing “directories”, there can be a lag in the appearance of new objects, the disappearance of deleted objects and the changed state of updated objects. This can lead to silent data loss. A Hive query scanning all data in a source path may miss out on newly written data or pick up old deleted data. With bad inputs come bad outputs: the results of the query may be wrong, but this is unlikely to be noticed. It is particularly problematic when “committing” the intermediate results of operations. Here individual tasks write their output into their private directories, output of which is then moved into the final destination directories by the listing of these directories and the renaming of all files in them. If the listing is not up to date, work is not included in the final results.
  2. Delete Inconsistency. The delete operations may not be visible to end users immediately. There can be a short period of time for which the end users can read the deleted contents. This can sometimes cause operations to fail when their checks for the destination directory not existing falsely conclude that the destination directory is still present.
  3. Update Inconsistency. It could take some time for the end users to see the updated contents of the file. For instance, if the file is overwritten/updated, there is a potential possibility that the end users can read old data for sometime.

Some of the real world use cases which can be impacted due to the S3 eventual consistency model are:

  1. Listing Files. Newly created files might not be visible for data processing. In Hive, Spark and MapReduce, this can lead to erroneous results from incomplete source data or failure to commit all intermediate results.
  2. ETL Workflow. Systems like Oozie rely on marker files to trigger the subsequent workflows. Any delay in the visibility of these files can lead to delays in the subsequent workflows.
  3. Existence-guarded path operations. Any action which fails if the destination path is present may see a deleted file in a listing, and so fail — even though the file has already been deleted.

List inconsistency is the most important issue we would like to solve.  This can lead to potential data loss in production systems. There have been attempts to fix this. Netflix’s S3mper being the original one and Amazon EMR re-implementing something similar. They store all that file system tree information in a second index store and advise the file system implementations (mainly the NativeS3FileSystem) by means of aspect oriented programming to check that index store for consistency. However, there has never been any equivalent in the open source S3 client(s) in Apache Hadoop.

The Solution – S3Guard

That has changed with the S3Guard extension to Hadoop S3A FileSystem, in which a metadata store can be used either as a high-performance cache of directory information, or as a complete, high-performance reference “store of record” for all the S3 directory information. Currently S3Guard uses the Amazon DynamoDB as the metadata store because of its low latency, high availability and seamless scalability. More importantly, because users are using Amazon S3 web service, it makes perfect sense for them to use another fully-managed web service like DynamoDB instead of maintaining the secondary metadata store themselves.

S3Guard extension to Hadoop S3A FileSystem,
S3Guard extension to Hadoop S3A FileSystem,

As indicated in the above figure, hadoop applications use the S3A filesystem client whose file reads and writes we have accelerated in Hortonworks Data Platform 2.6 through better support for random IO in reads and faster upload. Only now, the client has had the transparent S3Guard extension enabled. Now any write operations that mutate the file system tree such as file creation and deletion will firstly go to S3 for persisting objects, after which they will update the DynamoDB metadata store accordingly. Read operations continue to return results to callers as sourced from S3 as the system of record, but those operations first check their results against the metadata in the consistent store. By uniting the results from both S3 and DynamoDB, listing operations (e.g. listStatus, listLocatedStatus and listFiles) will have latest view of the file system tree information. Overall, S3Guard enables S3 to be used as the intermediate store of queries and the direct destination of output with a consistent model.

Meanwhile, for cached metadata in DynamoDB, S3Guard reduces the number of calls to S3 and helps in improving the performance of listing files — that being a core operation in the “split calculation” at the start of queries. This performance speedup means that S3Guard has tangible benefits when executing queries, as well as the less visible but more critical prevention of inconsistent listings. Our early performance benchmarking shows that S3Guard can cut split computation time in half in datasets involving large number of partitions. S3Guard doesn’t just deliver consistency — it delivers speed.

Hive Query Performance with S3Guard

This S3Guard feature is available as a technical preview in Hortonworks Data Platform 2.6. It is entirely backwards compatible with the existing S3A connector as it does not change how files and directories are stored in S3 backend. As the DynamoDB service uses the same authentication mechanisms as S3, users do not have any dedicated authentication configuration. Moreover, the only new dependent library is DynamoDB client, which is already included by the AWS SDK bundle. That means you do not need any new module dependencies for using this feature in your applications. However, You will need to create a DynamoDB table and provision it for your desired IO rate. This is something Amazon will bill you for.

The Details

By default S3Guard is disabled. To enable S3Guard for all S3 buckets, set the following config values in core-site.xml:

Additionally, you can specify the DynamoDB table name via fs.s3a.s3guard.ddb.table config, without which S3Guard will use the S3 bucket name by default. You can also set the DynamoDB region via fs.s3a.s3guard.ddb.region, without which S3Guard will use the same S3 region for locality. Alternatively, you can only enable S3Guard for specific S3 buckets. All of the above config keys support setting S3 options one a per bucket basis. For example, to enable S3Guard only for bucket myawsbucket, set fs.s3a.bucket.myawsbucket.metadatastore.impl with value org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore.

There is also a S3Guard command line to for managing the metadata store.

For example, to pre-populate a metadata store according to the current contents of an S3 bucket, run:

$ hadoop s3guard import [-meta URI] s3a://BUCKET

You can view the design documentation and follow the development status at Apache JIRA HADOOP-13345. In our ongoing effort, we want to enhance delete consistency,  implement retry policies for dealing with disagreements between S3 and metadata store, and t further improve list performance if the subtree of file system is fully tracked in metadata store. In addition,  we are working on the next big step change: a zero-rename committer to S3 using S3Guard.

Thank You and Learn More

All this work benefits from collective efforts in Apache Hadoop community. Hortonworks engineers from different teams (HDFS, Spark and Hive) participated in the design and implementation actively. We would like to thank everyone in community who contributed to this project. Specially, we are grateful to Ram Venkatesh, Sanjay Radia, Jitendra Pandey, and Jeffrey Sposetti from Hortonworks, Aaron Fabbri, Sean Mackrory and Lei (Eddy) Xu from Cloudera, Thomas Demoor from Western Digital, Ai Deng, and Chris Nauroth who started up this work. Finally, special praise for our colleagues in the QE Team at Hortonworks. They’re the ones that make sure this doesn’t break anything!

Resource Link

Hortonworks Data Platform 2.6

Product Documentation
Cloud Data Access Guide

Configuring S3Guard

(Technical Preview)



Gaurav Shah says:

Thanks for the wonderful article Steve.
Is this part of Hadoop 2.8.x or 2.7.x ?

Steve Loughran says:

S3Guard itself is just coming together to be merged into Hadoop trunk (3.0), backported to 2.9. We’ve been shipping a preview of int HDC, our HDP on cloud product.

Apache Hadoop 2.8 has many s3a performance improvements over Hadoop 2.7, for reading and writing data. If you are working with S3: use it. But you cannot safely write work directly to s3 with it, due to the need for the commit algorithm to be able to list all the files to commit -for which it must have a consistent filesystem

Gaurav Shah says:


Cyrus V2 Installer says:

That’s a fantastic article buddy. Keep writing quality article like these and ill appreciate your efforts. Thanks.

Brian Kramer says:

I’m interested in learning whether there is a standalone proxy of some kind, preferably with a C++ implementation, that offers the same functionality as S3Guard. We have a C++ client that uses direct REST calls, and during my exploring whether we should switch to DynamoDB, I came across this concept.

Joel Stenquist says:

Link for “​Using S3Guard for Consistent S3 Metadata” doesnt work, here is the new one –

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums