Guest Author: Nagapriya Tiruthani, Offering Manager, IBM Db2 Big SQL
Rules of the game for data analytics have changed. Data is no longer only available in relational databases. With data flowing in from various sources like IoT sensors, clickstream, mobile, etc., data is moving in at a much faster and richer pace. One caveat is, the data coming from various sources are not always in the same format. With such high volume, variety, and velocity data, we get a chance to take analytics up a notch to have deep analytics to forecast and predict new opportunities, find new customer segments and stay competitive.
But not all the data can be stored in one repository. Some data is right for relational databases, some are good for NoSQL databases and some for Apache Hadoop. With the data being so disparate, a data engineer or a data scientist spends most of their valuable time trying to bring data together in one place to enable querying rather than getting insights right away.
Why spend time to get the data together rather than start getting insights from it right away? How can you bring data together when data is stored in different repositories? How do we ensure data access is secure?
This is a perfect scenario for Db2 Big SQL. Db2 Big SQL is a SQL on Hadoop that enables querying data that is not only on Hadoop but also query data on RDBMS, NoSQL, Object Stores and any source that has a Spark connector. Federation enables you to have only one connection for all kinds of data sources without the need to move or replicate data. When you connect to Db2 Big SQL, you can query data that is stored anywhere using ANSI SQL to transform, join, etc.
Some advantages of having such rich set of data, from various data sources, enables you to
Db2 Big SQL is installed on Hortonworks Data Platform (HDP) and can query Apache Hive and Apache HBase tables in the same query. Using Federation, you can now combine data on Hadoop with traditional data sources like Oracle, Db2, Teradata, Netezza, Informix, MS SQL Server, etc. In v5.0, we introduced Apache Spark integration that enables querying data on data sources that have a Spark connector. This not only opens a whole new range of data sources to query but also enables working with machine learning (ML) models stored in Hadoop. With this capability, you can operationalize a ML model with SQL and input data to this model even when it is not stored in a single location. Therefore, with federation you can query data across different data sources and also use that data as input to the ML model.
Here’s a figure that shows how federation works with Db2 Big SQL as a single-entry point for disparate data.
Here are some highlights that Db2 Big SQL’s brings when you federate to disparate data:
How does Db2 Big SQL do this?
Db2 Big SQL brings the decades of research SQL engine Db2 to Hadoop. This brings along the best cost-based optimizer and query compiler to the Big Data world. What this really means is, when a query is submitted to Db2 Big SQL, the query engine rewrites the query in such a way that the execution plan is optimized based on data location, table and column statistics, etc. When the query has predicates, the predicates gets pushed down closer to data to retrieve only the necessary data rather than fetching the whole table to the central execution engine.
How does Db2 Big SQL handle querying against different data sources that has proprietary SQL dialects?
When you have a query that combines data on Hadoop and data in an Oracle database, Db2 Big SQL recognizes that the remote data sources and rewrites the query in such a way that it can be used to query the remote data source.
Are all the predicates pushdown down?
Yes, Db2 Big SQL does effective query pushdown but that does not mean all predicates will be pushed down. The decision on where and how predicates will be pushed down is based on estimates of rows processed or returned.
For example, if you have a query that joins 2 tables, T1 (10,000 rows) and T2 (25 rows), in a single remote source that yields a Cartesian product.
This drastically reduces the amount of data being transferred between the remote data source and Db2 Big SQL. This type of calculation influences effective predicate pushdown.
How can data access be secure?
Federation is a feature in Db2 that has been availed in enterprise environments. Db2 Big SQL brings this enterprise capability to Hadoop. When it comes to secure data access, there are many ways to ensure only the right users can access the remote data and reduce maintenance. The most common way to keep the remote data secure is to create user mappings or use trusted context that defines a trust relationship between the client and the remote data source.
With all the advanced capabilities available, Db2 Big SQL makes querying disparate data straight forward and immediate rather than shifting data, replicating, and refreshing it to gain insights. To summarize, the benefits of using federation are:
So, stop worrying about copying data to Hadoop and start working on the data at its source.
Watch our webinar called Making Enterprise Big Data Small with ease here.
Get some good information on federation in Db2 Big SQL documentation here.