Thoughts on the ‘Big Data in Science’ Workshop

Last week I had the pleasure of attending a workshop at Imperial College London on  The Future of Big Data Management. This was organized by some of the CERN physicists who were interested in bringing together scientists across different fields together with those of us in the computing industry working on some of the same problems.

CERN’s Hadron Collider Experiments -ATLAS and CMS being the big two- are the latest in a long line of particle detectors that have always stressed the computing, network and storage technologies of the time.  This workshop was set up not so much to look at the challenges of those teams (which are significant!), but on the impact that Big Data would have on other sciences -and what the issues would be.

As a result, there was a very interesting attendee list, from the people building the latest generation of the UK academic network, SuperJanet 6, to people just starting to think about the impact that large, invariably machine generated, datasets will have on their diverse sciences.

The presentations are up online for anyone interested, including my own on HDFS and MapReduce and beyond (which is embedded below too).

[slideshare id=23613004&doc=2013-06-26-atlas-yarn-130628054411-phpapp02]

It would take too long to go through each talk and go through them in the detail they deserve -instead here are some summaries of some different areas.

By the end of the decade, astronomy is likely to be the science with the highest data rates. Why? The Square Kilometer Array, being built in remote sites in South Africa, Australia and New Zealand, will be receiving data from its radio telescopes at a rate of 4 Petabits/second: 300 Petabytes/year. The follow-on, SK2, will generate ten times a much: 3 Exabytes a year. Those are numbers that even those of who build “production” Hadoop clusters of tens of Petabytes are impressed by: handling the data ingress and the storage and the computation are going to be massive undertakings.

Bioinfomatics and Genealogy are other area of data growth —as are environmental and weather data; —the latter combining  climate modeling, satellite data, and weather data. As better instruments are deployed and simulation work at higher resolution and on faster HPC clusters, rate at which data collects can only increase. There’s a lot of diverse data here, which means that there are opportunities for combining it for new insights. The growth in the size of that data is leading to a change in how people combine it: instead of downloading it to their local machines to analyze, now they need to bring their code to the data. As a result, archival organizations, such as the Centre for Environmental Data Archival, are having to move beyond storing the data to actually allowing their users to upload VM images and other code into their datacenter.

One recurrent theme in the workshop was “how to work with this data in 35-50 years?” The concerns here aren’t just preserving the bits, but being able to extract the information in them. Self-describing files as in formats like Avro and ORC  as well as HCatalog’s ability to share those schemas reflect some of our experience there: you never want to have files whose inner data formats are lost in the past.  What matters in science is the ability  for other scientists to take that data and reproduce the analysis, either with the same tools or new ones —possibly decades later

It may seem over-cautious to worry about data retrieval problems fifty years hence — but the Y2K bug showed how code can outlive their expected lives —data must be designed to outlive the applications. It also helps to have applications and tools that aren’t handwritten by individuals with the expectation that only they will be the people who use and maintain them. Higher level tooling , which in the Hadoop space, that’s Hive, Pig and Cascading, offload the maintenance effort —and help others to work with your data.

From a Hadoop perspective, much of the stack we are working on could be of benefit, but it’s going to take support from the Hadoop community to help, not just in education and support, but in actually collaborating on projects -maybe even sharing your own datasets.

To close on that point, I’d like to repeat what a speaker on big data in the humanities said:

One hundred years from now, a historian will want to use supermarket loyalty card data for their thesis on the recession

Find out more about using Hadoop to solve your big data challenges.


Categorized by :

Leave a Reply

Your email address will not be published. Required fields are marked *

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.