Big Data Defined

‘Big Data’ has become a hot buzzword, but a poorly defined one. Here we will define it.

Wikipedia defines Big Data in terms of the problems posed by the awkwardness of legacy tools in supporting massive datasets:

In information technology, big data[1][2] is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

It is better to define ‘Big Data’ in terms of opportunity, in terms of transformative economics. Big Data is the opportunity space created by new open source, distributed systems from the consumer internet space.

Specifically, a Big Data system has four properties:

  • It uses local storage to be fast but inexpensive
  • It uses clusters of commodity hardware to be inexpensive
  • It uses free software to be inexpensive
  • It is open source to avoid expensive vendor lock-in

Cheap storage means logging enormous volumes of data to many disks is easy. Processing this data is less so. Distributed systems which have the above four properties are disruptive because they are approximately 100 times cheaper than other systems for processing large volumes of data, and because they deliver high I/O performance for the buck.

Apache Hadoop is one such system. Hadoop ties together a cluster of commodity machines with local storage using free and open source software to store and process vast amounts of data at a fraction of the cost of any other system.

SAN Storage NAS Filers Local Storage
$2-10/GB $1-5/GB $0.05/GB

It is out of this cost differential that our opportunity arises: to log every shred of data we can in the cheapest place possible. To provide access to this data across the organization. To mine our data for value. To undergo the transformative processes that unabridged access to data provides, enabling bigger, better, faster more profound insight than ever before.

This is a working definition of Big Data.

What do you think? What is your definition of Big Data?

Categorized by :
Big Data Other

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

HDP 2.1 Webinar Series
Join us for a series of talks on some of the new enterprise functionality available in HDP 2.1 including data governance, security, operations and data access :
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.