This blog has contributions from Mingliang Liu and Rajesh Balamohan.
Late last year, we provided a brief history of Apache Hadoop support for Amazon S3. Our first focus of work was speeding up the read of S3-hosted data acting as a query input. That was followed by the write pipeline, as well as scaling and security. There is one big missing item remaining: using S3 as the direct destination of work. The fact that the API gives it the appearance of a filesystem means that people can try to use it instead of HDFS as the destination of Hive, Spark and MapReduce queries. This is something which appears to work, albeit slowly, but which is insidiously dangerous (due to the S3 eventual consistency model which is not the same requirement as a filesystem). In this blog, we’ll cover S3Guard and how it helps eliminates this consistency situation.
Today, using Amazon S3 as the direct output filesystem for queries is dangerous, as the eventually-consistent nature of the store means that invalid results may be generated. This is more likely with larger datasets and longer queries — the kind used in production, rather than development.
It’s generally using Amazon S3 as the direct output filesystem (via DistCp tool), or as a way of processing asynchronously collected data is fine. But not — yet — as the direct destination of work. Nevertheless, our stance is currently , “don’t”. Instead, work with HDFS and copy up the data afterwards.
To understand this, let’s look at the Amazon S3 consistency model. According to its official documentation, for all regions, S3 provides read-after-write consistency for the uploading of new objects (with caveats); it offers eventual consistency for overwritten/updated objects, and object deletion. For all operations, there may be a lag in the change in state of the object, and the value returned in listing operations.
This results in three very significant differences between the consistency of a filesystem and that of S3:
Some of the real world use cases which can be impacted due to the S3 eventual consistency model are:
List inconsistency is the most important issue we would like to solve. This can lead to potential data loss in production systems. There have been attempts to fix this. Netflix’s S3mper being the original one and Amazon EMR re-implementing something similar. They store all that file system tree information in a second index store and advise the file system implementations (mainly the NativeS3FileSystem) by means of aspect oriented programming to check that index store for consistency. However, there has never been any equivalent in the open source S3 client(s) in Apache Hadoop.
That has changed with the S3Guard extension to Hadoop S3A FileSystem, in which a metadata store can be used either as a high-performance cache of directory information, or as a complete, high-performance reference “store of record” for all the S3 directory information. Currently S3Guard uses the Amazon DynamoDB as the metadata store because of its low latency, high availability and seamless scalability. More importantly, because users are using Amazon S3 web service, it makes perfect sense for them to use another fully-managed web service like DynamoDB instead of maintaining the secondary metadata store themselves.
As indicated in the above figure, hadoop applications use the S3A filesystem client whose file reads and writes we have accelerated in Hortonworks Data Platform 2.6 through better support for random IO in reads and faster upload. Only now, the client has had the transparent S3Guard extension enabled. Now any write operations that mutate the file system tree such as file creation and deletion will firstly go to S3 for persisting objects, after which they will update the DynamoDB metadata store accordingly. Read operations continue to return results to callers as sourced from S3 as the system of record, but those operations first check their results against the metadata in the consistent store. By uniting the results from both S3 and DynamoDB, listing operations (e.g.
listFiles) will have latest view of the file system tree information. Overall, S3Guard enables S3 to be used as the intermediate store of queries and the direct destination of output with a consistent model.
Meanwhile, for cached metadata in DynamoDB, S3Guard reduces the number of calls to S3 and helps in improving the performance of listing files — that being a core operation in the “split calculation” at the start of queries. This performance speedup means that S3Guard has tangible benefits when executing queries, as well as the less visible but more critical prevention of inconsistent listings. Our early performance benchmarking shows that S3Guard can cut split computation time in half in datasets involving large number of partitions. S3Guard doesn’t just deliver consistency — it delivers speed.
This S3Guard feature is available as a technical preview in Hortonworks Data Platform 2.6. It is entirely backwards compatible with the existing S3A connector as it does not change how files and directories are stored in S3 backend. As the DynamoDB service uses the same authentication mechanisms as S3, users do not have any dedicated authentication configuration. Moreover, the only new dependent library is DynamoDB client, which is already included by the AWS SDK bundle. That means you do not need any new module dependencies for using this feature in your applications. However, You will need to create a DynamoDB table and provision it for your desired IO rate. This is something Amazon will bill you for.
By default S3Guard is disabled. To enable S3Guard for all S3 buckets, set the following config values in
Additionally, you can specify the DynamoDB table name via
fs.s3a.s3guard.ddb.table config, without which S3Guard will use the S3 bucket name by default. You can also set the DynamoDB region via
fs.s3a.s3guard.ddb.region, without which S3Guard will use the same S3 region for locality. Alternatively, you can only enable S3Guard for specific S3 buckets. All of the above config keys support setting S3 options one a per bucket basis. For example, to enable S3Guard only for bucket myawsbucket, set
fs.s3a.bucket.myawsbucket.metadatastore.impl with value
There is also a S3Guard command line to for managing the metadata store.
For example, to pre-populate a metadata store according to the current contents of an S3 bucket, run:
$ hadoop s3guard import [-meta URI] s3a://BUCKET
You can view the design documentation and follow the development status at Apache JIRA HADOOP-13345. In our ongoing effort, we want to enhance delete consistency, implement retry policies for dealing with disagreements between S3 and metadata store, and t further improve list performance if the subtree of file system is fully tracked in metadata store. In addition, we are working on the next big step change: a zero-rename committer to S3 using S3Guard.
All this work benefits from collective efforts in Apache Hadoop community. Hortonworks engineers from different teams (HDFS, Spark and Hive) participated in the design and implementation actively. We would like to thank everyone in community who contributed to this project. Specially, we are grateful to Ram Venkatesh, Sanjay Radia, Jitendra Pandey, and Jeffrey Sposetti from Hortonworks, Aaron Fabbri, Sean Mackrory and Lei (Eddy) Xu from Cloudera, Thomas Demoor from Western Digital, Ai Deng, and Chris Nauroth who started up this work. Finally, special praise for our colleagues in the QE Team at Hortonworks. They’re the ones that make sure this doesn’t break anything!
Hortonworks Data Platform 2.6
|Cloud Data Access Guide||https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_cloud-data-access/content/about.html|