Apache Hadoop and Data Agility

In a recent blog post I mentioned the 4 reasons for using Hadoop for data science. In this blog post I would like to dive deeper into the last of these reasons: data agility.

In most existing data architectures, based on relational database systems, the data schema is of central importance, and needs to be designed and maintained carefully over the lifetime of the project. Furthermore, whatever data fits into the schema will be stored, and everything else typically gets ignored and lost. Changing the schema is a significant undertaking, one that most IT organizations don’t take lightly. In fact, it is not uncommon for a schema change in an operational RDBMS system to take 6-12 months if not more.

Hadoop is different. A schema is not needed when you write data; instead the schema is applied when using the data for some application, thus the concept of “schema on read”.

With Hadoop, storing a new type of data is as simple as creating a new folder and pushing the new data files into that folder. It doesn’t require an IT project to redesign the schema and upgrade production systems with that new schema.

Teams developing data products using Hadoop benefit from much shorter development cycles, and are able to test 5-10x more hypotheses in a given time-frame. Very quickly, people notice the shorter innovation cycles, and more teams start using Hadoop to gain the same benefit.

By the way, although Hadoop is mostly used to store and process really big datasets (aka big data), this benefit of data agility is true for any dataset stored on Hadoop, big or small.

Categorized by :
Hadoop

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Join the Webinar!

In Memory Processing with Apache Spark
Thursday, March 12, 2015
1:00 PM Eastern / 10:00 AM Pacific

More Webinars »

Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycle with CSC
Thursday, March 19, 2015
1:00 PM Eastern / 10:00 AM Pacific

More Webinars »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.