Learn how you can modernize your data warehouse with HadoopDOWNLOAD WHITEPAPER
Enterprise Data Warehouse (EDW) is an organization’s central data repository that is built to support business decisions. EDW contains data related to areas that the company wants to analyze. For a manufacturer, it might be customer, product or bill of material. EDW is built by extracting data from a number of operational systems. As the data is fed into EDW it is converted, reformatted and summarized to present a single corporate view. Data is added into the data warehouse over time in the form of snapshots and normally EDW contains data spanning 5 to 10 years.
EDW is Expensive
Built on commercial and proprietary technology that is expensive to acquire (licensing cost)
Runs on expensive converged appliances
Cost continues to rise as new users and data is added to EDW
Operationally expensive – takes 18 to 24 months to find data sources, agree on business questions and model the data to answer them
EDW is Rigid
Data model must be in place before a single business question can be answered using the data in EDW, (schema-on-write)
Designed to answer pre-determined questions.
Data modeling is a lengthy and labor intensive process
Any change in the organization’s business model requires a change in the EDW’s data mode
EDW is Inefficient
50-70% of data is unused and or cold in EDW
45-65% of CPU capacity is used for ETL/ELT
25-35% of CPU consumed by ETL is to load unused data
30-40% of CPU is consumed by only 5% of ETL workloads
HDP (Hortonworks Data Platform) is 100% open - there is no licensing fee for software
HDP runs on commodity hardware
New data can be landed in HDP and used in days or even hours
Data can be loaded in HDP without having a data model in place
Data model can be applied based on the questions being asked of data (schema-on-read
HDP is designed to answer questions as they occur to the user
100% of the data is available at granular level for analysis
HDP can store and analyze both structured and unstructured data
Data can be analyzed in different ways to support diverse use cases
By design, Hadoop runs on low-cost commodity servers and direct attached storage that allows for a dramatically lower overall cost. When compared to high-end storage area networks, the option of scale-out commodity compute and storage using Hadoop provides a compelling alternative —one which allows the user to scale out their hardware only as their data grows. This cost dynamic makes it possible to store, process, access and analyze more data than ever before.
The ETL function is a relatively low-value computing workload that can be performed at a low cost in Hadoop. When onboarded to Hadoop, data is extracted, transformed and then the results are loaded into the data warehouse. The result: critical CPU cycles and storage space are freed for the truly high value functions – analytics and operations – that best leverage advanced capabilities in the data architecture.
An incredible array of new data types opens possibilities for analysis within the high-performance EDW environment. The varied structures of these new data types, however, present challenges for EDWs not designed to ingest and analyze those formats. Many organizations rely on the flexibility of Hadoop to capture, store and refine these new data types to use within the EDW. They take advantage of the ability to define schema upon read in Hadoop, gathering and storing data in any format and creating schema to support analysis in the EDW when necessary.