Big Data Defined
‘Big Data’ has become a hot buzzword, but a poorly defined one. Here we will define it.
Wikipedia defines Big Data in terms of the problems posed by the awkwardness of legacy tools in supporting massive datasets:
In information technology, big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
It is better to define ‘Big Data’ in terms of opportunity, in terms of transformative economics. Big Data is the opportunity space created by new open source, distributed systems from the consumer internet space.
Specifically, a Big Data system has four properties:
- It uses local storage to be fast but inexpensive
- It uses clusters of commodity hardware to be inexpensive
- It uses free software to be inexpensive
- It is open source to avoid expensive vendor lock-in
Cheap storage means logging enormous volumes of data to many disks is easy. Processing this data is less so. Distributed systems which have the above four properties are disruptive because they are approximately 100 times cheaper than other systems for processing large volumes of data, and because they deliver high I/O performance for the buck.
Apache Hadoop is one such system. Hadoop ties together a cluster of commodity machines with local storage using free and open source software to store and process vast amounts of data at a fraction of the cost of any other system.
|SAN Storage||NAS Filers||Local Storage|
It is out of this cost differential that our opportunity arises: to log every shred of data we can in the cheapest place possible. To provide access to this data across the organization. To mine our data for value. To undergo the transformative processes that unabridged access to data provides, enabling bigger, better, faster more profound insight than ever before.
This is a working definition of Big Data.
What do you think? What is your definition of Big Data?