Defining big data by the questions it allows to be asked

Enthusiasm for Hadoop big data analysis has led some people to see the technology as simply a fancy way to solve problems faster, but others argue that this view sells the true promise of big data short. In a recent article for Quartz, Gartner's Chris Guan explained that the combination of Hadoop file systems and cloud computing resources is enabling scientists to ask more ambitious questions than ever before and draw on data sets far larger than the ones used to power the groundbreaking achievements of the past.

"It is not that small data problems are no longer important, it is that even solutions lead to new questions, and science doesn't want to be bound by computational difficulty in the pursuit of answers," Guan wrote.

He outlined two laws in computer science, Amdahl's and Gustafson's, which show how fast a problem can be solved relative to the processing power it is given and how a problem can be solved in a set amount of time based on computing capability, respectively. While many IT professionals tend to focus on big data advancements from the perspective of Amdahl's Law, asking how analytics functions can be done more efficiently, the more exciting application in science might be one that inverts this model and explores which questions big data enables researchers to ask. As data size becomes less of a limitation, many new avenues of research may open up.

Defining "big data"
In a post on the O'Reilly Radar blog, analyst Mike Loukides reiterated the importance of thinking about how the term "big data" is defined. Data sets that may seem big now will soon seem inconsequential – after all, a "big" data set in the 1960s was only a few megabytes. Loukides highlighted Roger Magoulas' definition of big data – when the size of the data becomes part of the analysis challenge – and noted its scalability.

"Data, and specifically 'big data,' will always be at the edges of research and understanding," he wrote. "Whether we're mapping the brain or figuring out how the universe works, the biggest problems will almost always be the ones for which the size of the data is part of the problem. That's an invariant. That's why I'm excited about data."

As processing capabilities grow with Hadoop, researchers will not only have more tools to answer the questions they are currently asking, they will also be empowered to think of new and bigger questions that might have seemed impossible to digest in the past.

Categorized by :

Leave a Reply

Your email address will not be published. Required fields are marked *

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.