Apache Hadoop and Data Agility
In a recent blog post I mentioned the 4 reasons for using Hadoop for data science. In this blog post I would like to dive deeper into the last of these reasons: data agility.
In most existing data architectures, based on relational database systems, the data schema is of central importance, and needs to be designed and maintained carefully over the lifetime of the project. Furthermore, whatever data fits into the schema will be stored, and everything else typically gets ignored and lost. Changing the schema is a significant undertaking, one that most IT organizations don’t take lightly. In fact, it is not uncommon for a schema change in an operational RDBMS system to take 6-12 months if not more.
Hadoop is different. A schema is not needed when you write data; instead the schema is applied when using the data for some application, thus the concept of “schema on read”.
With Hadoop, storing a new type of data is as simple as creating a new folder and pushing the new data files into that folder. It doesn’t require an IT project to redesign the schema and upgrade production systems with that new schema.
Teams developing data products using Hadoop benefit from much shorter development cycles, and are able to test 5-10x more hypotheses in a given time-frame. Very quickly, people notice the shorter innovation cycles, and more teams start using Hadoop to gain the same benefit.
By the way, although Hadoop is mostly used to store and process really big datasets (aka big data), this benefit of data agility is true for any dataset stored on Hadoop, big or small.