How Facebook uses Hadoop and Hive

Social media giant Facebook is one of Hadoop and big data's biggest champions, and it claims to operate the largest single Hadoop Distributed Filesystem (HDFS) cluster anywhere, with more than 100 petabytes of disk space in a single system as of July 2012. The site stores more than 250 billion photos, with 350 million new ones uploaded every day, Jay Parikh, the company's vice president of infrastructure, told InformationWeek in a recent interview. He explained that the social network must use a number of tools – among them Hadoop, Hive and HBase – to manage its user information and effectively run its business.

According to Parikh, Hadoop is used in every Facebook product and in a variety of ways. User actions such as a "like" or a status update are stored in a highly distributed, customized MySQL database, but applications such as Facebook Messaging run on top of HBase, Hadoop's NoSQL database framework. All messages sent on desktop or mobile are persisted to HBase. Additionally, the company uses Hadoop and Hive to generate reports for third-party developers and advertisers who need to track the success of their applications or campaigns.

"All of those analytics are driven off of Hadoop, HDFS, Hive and interfaces that we've developed for developers, internal data scientists, product managers and external advertisers," Parikh said.

Creating faster queries
Hive, the data warehousing infrastructure Facebook helped develop to run on top of Hadoop, is central to meeting the company's reporting needs. Facebook must balance the need for rapid results in features such as its graph tools with simplicity and ease of reporting, so it is working on another contribution to Hive that will improve the speed of queries. Improving Hive's speed is important, as the scalability that makes the tool central to the social network's needs can come at the expense of low latency.

"Hive is still a workhorse and it will remain the workhorse for a long time because it's easy and it scales," Parikh said. "Easy is the key thing when you want lots of people to be able to engage with a tool. Hive is very simple to use, so we've been focused on performance to make it even more effective."

For companies just embarking on a big data initiative, striking a balance between handling technology challenges with Hadoop and deriving insight from data will be difficult but important, Parikh said. Businesses will need to experiment and maintain a constant focus on long-term goals to ensure they build out technology in the right way. However, with constant innovations in the open source Apache Hadoop community, businesses have more resources than ever to make data central to their operations in the same way as social media giants.

Categorized by :

Comments are closed.

Try it with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.

Get Sandbox
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.