Hadoop in Perspective: Systems for Scientific Computing
When the term scientific computing comes up in a conversation it’s usually just the occasional science geek who shows signs of recognition. But although most people have little or no knowledge of the field’s existence, it has been around since the second half of the twentieth century and has played an increasingly important role in many technological and scientific developments. Internet search engines, DNA analysis, weather forecasting, seismic analysis, renewable energy, and aircraft modeling are just a small number of examples where scientific computing is nowadays indispensible.
Apache Hadoop is a newcomer in scientific computing, and is welcomed as a great new addition to already existing systems. In this post I mean to give an introduction to systems for scientific computing, and I make an attempt at giving Hadoop a place in this picture. I start by discussing arguably the most important concept in scientific computing: parallel computing; what is it, how does it work, and what tools are available? Then I give an overview of the systems that are available for scientific computing at SURFsara, the Dutch center for academic IT and home to some of the world’s most powerful computing systems. I end with a short discussion on the questions that arise when there’s many different systems to choose from.
1. Parallel computing
The basic idea
Let’s start with an example: Imagine you’re asked to perform one thousand calculations, consisting of operations like additions and subtractions, multiplications and divisions. Let’s assume you’re very good at math, you don’t get tired very easily, and that a single calculation would take you 5 seconds – quite a good average speed for a person, I’d say! Despite your impressive computational capabilities, you would need 5000 seconds, more than 83 minutes, to complete all the calculations.
Now let’s say you have cloned yourself one hundred times, so you can let your clones work simultaneously. Now it would take a hundred times less, so 50 seconds, to complete the work!
This is the basic idea of parallel computing:
Reducing the time needed to complete a computation by performing its calculations simultaneously on multiple processors.
Of course in real life we use computers instead of clones to do our calculations, avoiding the more inconvenient implications of cloning.
In the example above we assume that the thousand calculations are completely independent. This is not always the case. Sometimes the result of one calculation is needed as input for another calculation (this is called a fold or reduce operation – sounds familiar?). In such a case you cannot perform the calculations simultaneously. You’ll just have to do them sequentially, one at a time. And your clones will be standing idly by, not being of any use.
Just a short note on the possible speed-up of processing time when using parallel computing. Completely dependent and completely independent calculations are opposite ends of the spectrum. Often, in computational science, a problem will exist of one or more dependent parts that will need to run sequentially, and parts that can be run in parallel. In such cases the maximum speed-up gained by applying parallel computing is less intuitive. If you’re interested in understanding the theoretical speed-up gained by using multiple processors, then have a look at Amdahl’s law.
Tools and systems for parallel computing
But parallel computing is hard. My fellow countryman and 1972 Turing Award winner Edsger Dijkstra described and formalized many of the problems common in parallel computing, such as the sleeping barber problem, the dining philosophers problem, and the cigarette smokers problem. Synchronization, communication, and hard- and software failures are among the challenges that have to be dealt with. Luckily as a programmer you can stand on the shoulders of giants and use compilers, libraries, Application Programming Interfaces, and programming models that, often to different extents, deal with problems posed by parallel computing. Examples are Erlang, OpenMP, the Message Passing Interface, and recently MapReduce and other novelties like Apache Hama and Apache Drill, the former inspired by the Bulk Synchronous Parallel (BSP) model and the latter based on Google’s Dremel.
Which of these you use when you work on a parallel computing problem depends on different considerations. Such considerations might have theoretical drivers (e.g. my simulation inherently needs complex synchronization patterns, so I use MPI), practical drivers (e.g. I don’t know what I’m looking for and I’ll have to adjust and repeat my experiments indefinitely, so agility and time-to-solution are most important for me, that’s why I’m using MapReduce and Pig), personal drivers (e.g. I’m a computer scientist and in my opinion all implementations of BSP are ugly, so I’m going to create my own), and any combination of the three. There’s no rule of thumb when choosing a tool for your problem. I’ll touch on that again in the discussion.
And this is analogous to choosing what physical system to use underneath your tool. Theoretical drivers (e.g. my simulation needs a lot of communication, so I’ll need very big interconnects), practical drivers (e.g. I only have access to the SURFsara Hadoop system so I’ll just force my experiment to fit on the MapReduce model), personal drivers (e.g. I’m a computer scientist and a want to develop a new architecture for data-intensive workloads (like this research project for example)), and any combination of the three influence the eventual choice.
2. Scientific computing at SURFsara
SURFsara is the Dutch national center for academic IT. We are sometimes called a supercomputing center, but that’s just partly true; we do provide supercomputing, but we also provide other compute, data, network, and visualization services, and we offer consultancy so our clients understand how to apply these services for solving their (mostly) scientific problems. Here I’ll just focus on our computational services.
At SURFsara we have 6 different compute systems:
- the Dutch national supercomputer Huygens (to be replaced this year)
- the national compute cluster Lisa
- a GPU cluster (part of Lisa, I won’t go into this)
- an HPC cloud system called Calligo (won’t go into this either)
- the grid
- and a Hadoop cluster called Alley.
The giant, the ants, and the elephant: Supercomputing vs clustercomputing vs gridcomputing vs Hadoop
Huygens is an IBM pSeries 575 system with a total of 3456 IBM Power6 cores, running at 4.7GHz. It has 15.75TB of memory and is tied together with a 160Gbit per second infiniband inter-node interconnect. The system includes a Storage Area Network (SAN) with 3240 fiber channel disks with 300GB capacity, running at 15000 rpm in different configurations. 20 racks are used for this storage, which amounts to a total of 972TB of raw capacity (resulting in ~700TB net storage).
Huygens is a well balanced system. It doesn’t just provide performance in terms of what we call FLOPS (FLoating Point OperationS), an important metric in the top 500 of supercomputers, but it provides very fast inter-node interconnects and low-latency, high bandwidth access to data. This ensures it is suitable for a wider range of scientific applications, rather than a much narrower range of applications in need of an even higher amount of FLOPS.
But Huygens is still a very high performing, and thus expensive system. It is suboptimal to have a user who cannot spend much time optimizing his or her code, and who has no need for high performance access to data or fast interconnects, running the code on Huygens. So we also provide access to the national compute cluster Lisa.
Lisa is meant for users who do not need the capabilities of Huygens, but do need a system with a large capacity of compute power. It has a total of 6528 cores, most running at 2.26GHz, and provides a total of 16TB of memory. A portion has an inter-node interconnect of 12.8Gbit, while for other nodes the speed of the interconnects is slower. Lisa includes 100TB Network Attached Storage (NAS).
Lisa is very popular under our users. It provides a lot of capacity in terms of cores, and still comes with the capability of communication between nodes. This makes it suitable for large scale independent calculations, but also for fairly well performing large scale dependent calculations. It’s however not meant to do large scale data processing, as you can see from the size of the NAS.
The grid however, is much more suitable for doing large-scale data processing. The term High Throughput Computing (HTC) describes this characteristic and contrasts the type of computing with High Performance Computing (HPC), a term more suitable for Huygens and Lisa.
The grid is a system that provides even less capabilities than Lisa, but a lot more capacity in terms of computing and storage. The grid consists of a number of interconnected clusters, similar to Lisa in setup, but without the strong interconnects between nodes and with significantly more Network Attached Storage. Within a grid the clusters are geographically distributed and are often operated by different organizations. A grid can cover multiple continents. The European Grid Infrastructure (EGI) is an example of such a a world-wide grid. The Dutch part of EGI exists of about 15 clusters distributed throughout the Netherlands, and provides a total of around 10PB of disk storage (through locally attached NAS systems), 9000 cores, and 20PB of tape storage.
The system is used to run processes with many independent calculations simultaneously, while each process uses a single core and doesn’t communicate with other processes. Compared to Huygens and even Lisa, grids handle the simplest workloads. With a note: simple in terms of pure parallel computing – each process may be complex to any extent, and so might the orchestration be, in terms of dealing with fault tolerance and pipelining of processes. Grids provide by far the most capacity in terms of amount of cores and amount of storage, especially for those who have access to a world-wide grid.
However, the software stack (or more correctly: grid middleware) we use to provision the grid in The Netherlands, lacks a good abstraction of the grid as a whole. Granted, as a user you are able to submit a job (an set of calculations) to the grid, not knowing on what cluster it will land. But knowing where it will land is important if you use the grid for its high throughput capabilities; transferring large amounts of data from a remote NAS is suboptimal at best. Additionally, the grid does not natively provide fault tolerance. Users need to keep an account of all their jobs and check whether they have completed successfully, resubmitting them manually if they failed. Efforts to deal with these limitations have been ongoing for years, but do not get much further then the labs working on them. Most users deal with data locality and error handling themselves, through custom scripts or manually.
So then there’s Alley, our Hadoop cluster. Alley is a system that is optimized for capacity as well. It exists of 72 nodes, of which 66 are used as DataNodes and TaskTrackers. These provide a total of 66 nodes * 8 cores = 528 cores and 66 nodes * 4 disks * 2 TB = 528TB. Each node has 64GB of RAM, and a bonded 2Gb network connection. The cluster has a 10Gbit uplink to the Dutch university network. Since Hadoop includes HDFS there’s no NAS or SAN involved – a great way of saving money, this is the cheapest system we have! We’re currently in the process of the first of two extensions this year.
Hadoop is optimized for big data processing (or: High Throughput Computing), and provides the HDFS and MapReduce abstractions exactly for that purpose. You don’t need to worry about the low-level implications of parallel computing, about data-locality, or about error handling. The framework handles all of that for you, the only thing you need to do is to stay within the boundaries as dictated by HDFS and MapReduce. That is, no random reads, high latency (but high throughput), and a single synchronization barrier. In other words, a strict model that makes parallel computing easy.
3. Discussion: Jimmy’s hammer
A recurring question in scientific computing is how one can choose between one system or the other, and between one processing model and the other. You have seen above that the one is related to the other; choosing MPI for performing your calculations means you don’t use Hadoop or the grid, and reversely, choosing the grid or Hadoop as a system means you don’t use MPI (not completely true when YARN gets here, but you get the point).
The question is: to what extent can a problem be expressed to fit a certain parallel system, and to what extent can a parallel system be built to fit a specific problem? Fellow Hadoop enthusiast Jimmy Lin states that “if all you have is a hammer, throw away everything that is not a nail“. He argues that if a parallel programming model is “good enough” we will benefit if we adapt our problem. At SURFsara we share Jimmy’s opinion – we try to support as many types of computational applications as possible with a limited set of different tools. For us this is an obvious necessity driven by limited funds and the need for maintainability. We do have more than a single hammer though. We have 6 different tools, to be precise, and like any self respecting service-oriented organization we’re continuously asking ourselves the question: is this enough? Could we better serve our users with a more diversified toolset? Or just the opposite – should we reduce the complexity of our service portfolio, focus effort and master ourselves in less tools, even if that means we cannot serve certain types of applications? Which users, which applications, which domains would we fail if we’d lack a suitable processing model?
These are question that can be answered from many different angles. In the end it’s a continuous search for the right balance. We make the choice to stay informed of the latest developments, to keep discussing, and we do our best to be agile enough to be able to make a move one way, and move another way if that seems a better decision at the next moment in time.
The Stinger Initiative is a broad, community-based effort to drive the future of Apache Hive, delivering 100x performance improvements at petabyte scale with familiar SQL semantics. More »