Data-driven organizations know where their focus should lie—data collection and analytics. However, they rarely know the best way to divvy up that focus. Usually, about 80 percent is spent gathering data, with a mere 20 percent spent analyzing it. That’s not ideal.
Too often, companies approach their data collection this way because they lack the infrastructure to effectively ingest and process the vast streams of information at their disposal. Unfortunately, this can lead to incomplete data and the inability to make the most informed decisions for your business.
Not long ago, a relational database—a conventional storage system designed for structured data such as column-and-row spreadsheets—was sufficient for most organizations. But new types of unstructured data don’t fit within the structured confines of this setup.
Say, for example, a business is analyzing call center data to identify their customers’ primary intentions for calling in. In the past, a relational database would suffice to track which boxes were checked about why the customer called: repair request, order spare parts, and so on. This approach may have worked well in the 1990s, but today, accessing the unstructured data in freeform Twitter feeds, text fields, Yelp reviews, or voice messages could potentially be more interesting.
Relational data management systems were never designed to handle unstructured data sets, not the explosion in variety, volume, or velocity. The result: Data generation is far outpacing companies’ ability to ingest and analyze it. Unless companies can find a better solution, this data often gets ignored.
The downside of running a business on incomplete data can be substantial, particularly when your competitors leverage information sources that you have yet to explore. Legacy data solutions can make matters worse by creating data fragmentation, which prevents a business from gaining a comprehensive, single view of the customer. It also leads to higher turnaround times for certain tasks, as well as subpar decision-making.
One insurance provider, for example, had internal business units set up their own data storage islands. Each island held data specific to that unit’s function or products, and could only be accessed by employees within that group. This fragmentation slowed productivity and sales, particularly among the company’s independent insurance agent partners, who had to pull information from six different data islands to build a single view of the customer.
To solve the problem of fragmented or incomplete data, organizations are turning to Hadoop, an open source software platform for distributed storage and the processing of massive and diverse data sets on commodity hardware, such as desktop computers and workstations. Hadoop enables companies to realize the potential of big data and gather actionable insights from digital information. This can lead to a competitive advantage and better customer service.
While there are many facets—and a lot of confusing jargon—to Hadoop, its storage component can be explained simply. Hadoop allows you to store and process files that are larger than what can be saved on a single server or node. This enables organizations to store massively large files and a vast number of files. Unlike traditional relational database systems, Hadoop doesn’t require users to create structured schemas before storing data. It also allows users to save information in unstructured and semi-structured formats.
TrueCar, an automotive pricing and information site for car buyers and dealers, has hundreds of business partners and acquires data from many different sources. It looked to Hadoop to build an economical and scalable data architecture that could capture varied and vast amounts of information.
Before Hadoop, TrueCar relied on Microsoft SQL Server, which is a traditional relational database system that powered many of the company’s applications. But the company came to realize that its ambitious data goals far exceeded the capabilities of a traditional SQL Server environment. The switch to Hadoop enabled TrueCar to move all its data into one “data lake,” so the company could write applications that draw from a single information source. Prior to Hadoop, TrueCar had about 230 different databases spread throughout the company and spent too much time building data pipelines to shuttle information between these data fiefdoms.
In short, Hadoop is a game changer. It’s more than a data warehouse: it fuses data storage with data processing, thus enabling powerful and highly scalable applications that can give businesses a competitive advantage.
To see where you stand in your big data journey, check out the Hortonworks Big Data Scorecard.