Building a modern data architecture with Hadoop delivering high-scale and low-cost data processing means integrating Hadoop effectively inside the data center. For this post, we asked Yves de Montcheuil, VP of Marketing at Talend about his customers’ experiences with Hadoop integration. Here’s what he had to say:
Most organizations are still in the early stages of big data adoption, and few have thought beyond the technology angle of how big data will profoundly impact their processes and their information architecture. Whether big data projects are past the pilot stage and being deployed in production, or still on the horizon, they require strategic thinking and adequate planning to avoid some now-typical pitfalls that tend to get in the way of success for Hadoop projects.
Here are five key areas to watch:
- Forget volume (or rather, don’t focus on it). Big data is large – and small. It’s extremely diverse in origin, in style, in consistency and in quality. Some organizations in certain industries are dealing with massive data volumes, while others have much smaller data sets to exploit, but might have a broader variety of sources and formats. Make sure you go after the “right” data: identify all the sources that are relevant, and don’t be embarrassed if you don’t need to scale your data computing cluster to hundreds of nodes right away!
- Don’t leave data behind – be comprehensive. Some of the data you need for your big data projects is clearly identified, such as transactional data used or generated by business applications. However, more of this data is hidden in log files, manufacturing systems, desktops or various servers; this is what we call “Dark Data”. Some of it is even going to waste in the exhaust fumes of IT. This “Exhaust Data” from sensors and logs is purged after a certain amount of time, or never stored in the first place. All of it is potentially relevant. Don’t restrain your project to the first category: Inventory Dark Data, and deploy collection mechanisms for Exhaust Data, so that they become value contributors as well.
- Don’t move everything – distribute data “logically.” Too many organizations looking for ways to break down data silos bring all the data together in one central place, and Hadoop is an excellent storage resource for large amounts of data (and it is in itself distributed across clusters). However, you need to think “distribution” beyond Hadoop. It’s not always necessary to duplicate and replicate everything. Some data is already readily available in the enterprise data warehouse, with fast, random access. Some of it might be better off residing where it was produced. The “Logical Data Warehouse” concept applies well in the “non big data” world. Leverage it for big data.
- It’s not only about storage – think processing platform. Hadoop is not only a receptacle for big data with its distributed file system, but it is also an engine that brings incredible potential to process data and extract meaningful information. A broad ecosystem of tools and programming paradigms exist that cover all use cases of data manipulation. From MapReduce to YARN, from Pig to HiveQL/Stinger, there are processing resources available that make it unnecessary to get data out of the platform. All the resources are here, at your fingertips.
- Lastly, don’t treat big data as an isolated island. Sandboxes are fine for proof of concepts, but when big data projects go live, they need to be an integral part of the overall IT infrastructure and information architecture. You need to connect big data applications to other systems, upstream and downstream. Big data must also become part of your IT and information governance policy.
While interest in and the roll-out of big data strategies has increased significantly, many organizations are still stuck in the starting blocks. Because of the novelty of the platforms and their applications, big data projects typically get under the spotlight and expectations are extremely high.
Avoiding these common pitfalls will help organizations learn from the experience of others and steer clear of obstacles on their big data journey.