Hadoop tutorial: Collecting data for an analytics project

When properly deployed, Apache Hadoop projects can significantly benefit enterprise operations. However, there are hurdles that need to be cleared before businesses can fully appreciate the benefits provided by this burgeoning technology. For one, the mere idea of launching a data analytics program without properly understanding the components involved can be daunting. As its name implies, big data analytics requires a massive number of data sets to be compiled and processed in order to achieve the deepest and most actionable insights. Some business leaders may be unsure whether they have the infrastructure in place to accommodate that level of storage. However, the proliferation of Hadoop has provided numerous companies with the resources needed to house their data analytics projects.

Data analytics expert and Data Informed contributor David Loshin recently explained that the Hadoop platform allows system administrators to increase their storage capacity by deploying several nodes and clusters within a connected network. Another limitation that businesses might face beyond the sheer volume of data is the monumental size of individual files. The Hadoop Distributed File System addresses this concern by distributing data among its network of nodes. In addition, data is replicated across multiple clusters, providing a level of security in the event that one should fail. This combination of performance and security provides an optimal level of data storage reliability for analytics projects.

Is there such a thing as too much data?
With the massive quantities of information being collected by Hadoop-based programs, business leaders may begin to question how much data is too much. They may be tempted to jettison any information that they have deemed trivial or inconsequential. However, big data projects have shown time and again that no information is inherently trivial. Larger data pools allow analytics processes to drill deeper and extract more accurate insights. Throwing away information for any reason could inhibit the effectiveness of an analytics campaign. Data security expert Joe Gottlieb recently spoke to InformationWeek about this conflict, noting that changing circumstances can make previously unusable information much more pertinent.

"You don't know what question you're going to answer tomorrow, and when you ask it, you'll be relieved that you kept the data," Gottlieb told the news outlet. He continued, "Err on the side of keeping more stuff, but keep an eye on it. Keep as much as you can afford to, and push yourself to use it. If you're not using it, you're going to start to feel badly about spending the money to store it."

Gathering data is the first step to deploying an effective analytics project. By leveraging Hadoop's network of nodes and clusters, business leaders can be sure that they have the resources available to accommodate this need.

Categorized by :

Leave a Reply

Your email address will not be published. Required fields are marked *

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.