One of the great things about working in open source development is working with other experts round the work on big projects – and then having the results of that work in the hands of users within a short period of time.
This is why I’m really excited about the Rackspace announcement of their HDP-based Big Data offerings, both “on-prem” and in cloud. Not just because its partners of us offering a service based on Hadoop, but because it shows how Hadoop integration with OpenStack has reached a point where it’s ready for production use. And because to get to that state the Hadoop and OpenStack development communities had to work together.
I can only vouch for one part of the collaboration, the OpenStack Swift FileSystem client for Apache Hadoop. As customers run their applications in the public cloud, object stores like Swift are being used for data backup and long term archival. This new client allows Hadoop applications – including MapReduce, Pig and Hive Queries to work with data stored in a Swift Object Store -for both reading and writing. With a new Filesystem URL,
swift://, Hadoop code deployed in an OpenStack cluster can work with persistent data in the Swift store. The Hadoop VMs in the cluster use their virtual disks for HDFS storage -gaining the performance offered by local access to files divided into blocks -the applications just need to write the final data back to swift before destroying the cluster. As a result of this, customers can avoid the compute plus storage cost of maintaining a long running Hadoop cluster for analytics in the public cloud.
There’s one other feature this
swift:// filesystem can do: it lets you work with remote object stores, in-house or external, using different login credentials. That lets you run Hadoop jobs which pull in data from a remote swift store, work on it in Hadoop, then write it to the local swift store. It also lets Hadoop clusters that aren’t running in OpenStack work with OpenStack Swift object stores -such as the Rackspace cloud storage services round the world.
This Swift FileSystem client is a critical feature of OpenStack-Hosted Hadoop services, no matter whose OpenStack cluster it is, or whose Hadoop-based data stack is running in the cluster: vanilla Apache Hadoop, Hortonworks Data Platform, or other offerings. And look forward to this, because we put a lot of effort into this integration, in the form of the HADOOP-8545 feature. This began as a development between Hortonworks and Rackspace, and was soon joined by Mirantis. Together we developed the code to talk to OpenStack, wrote the tests to show it worked, ran those tests against public and private Swift filesystems -and debugged everything when something didn’t work.
This is where the cross-community development paid off: when things play up – such as Hadoop’s expectation of what block size a filesystem should report even when it doesn’t support blocks. Then there is how some public Swift filesystems throttle operations to prevent denial of service attacks – and deleting large directory trees can trigger this behavior. Finally, there’s performance: how to get data back fast, while scaling up well. We at Hortonworks knew the Hadoop-side of the problem, but we depended on the skills of Rackspace and Mirantis for understanding OpenStack, and together we identified and fixed things. They also set up their test environments to verify that everything worked with future releases of OpenStack and Rackspace cloud – while we did the same with Hadoop’s evolving code.
That’s why the Rackspace product is so exciting – it doesn’t just represent HDP-as-a-service, it represents the culmination of collaboration across different open source communities, to produce a system that benefits everyone.
And it doesn’t stop there. We’re still working on Hadoop/OpenStack integration is part of The Savanna Project, a project to make it easier for everyone to “spin up” Hadoop clusters. We’re looking forward to great things there -and everyone is excited about what Hadoop 2 and HDP-2 are going to let people do in this world.
Learn more about Rackspace Big Data Cloud Platform with Hortonworks here.