The Financial Services industry is undergoing a major transformation. Innovation in data technologies is driving growth of predictive analytics and data mining techniques that will dramatically change banking over the next few years. This is the first of three blogs that will describe that transformation. In this one, I’ll cover the importance of data science in banking and introduce key technological advances in the Apache™ Hadoop® ecosystem.
Connected Data Platforms for Actionable Data Science
The first phase of Apache Hadoop enabled batch processing for large, distributed data sets. Many of the founding members of Hortonworks came from the Yahoo! team that architected the first version of Apache Hadoop. That first version met Yahoo’s requirements for their first use case: indexing the worldwide web. But as the platform gained popularity, those initial architects saw that it could meet a broader range of use cases across many industries. Hortonworks was founded with the proposition of extending Apache Hadoop for the enterprise, so that every major industry could take advantage of its unique strengths with Big Data.
Early Hortonworks customers in the financial services industry adopted Hortonworks Data Platform for better customer insights, reducing financial risk, and maintaining regulatory compliance and reporting in the face of a growing tsunami of cybercrime and fraud.
But Hadoop 1.x had limitations that slowed its adoption among data scientists in banks. Those same gaps also impeded its adoption by other industries. With Hortonworks leadership, the open-source community rallied to extend and harden Apache Hadoop so that it could become the de facto platform to store and process enterprise data at scale.
With Hortonworks leadership in the community, Apache Hadoop 2.x enhanced the platform’s original efficiency and economics for storing and processing data in batch and extended it in these important ways:
That last point around predictive analytics is especially important for data scientists at leading financial services companies. They process and extract meaningful insights from large volumes of structured and unstructured data. Companies then operationalize those insights to provide commercial value to their customers, employees and shareholders. Hadoop and its ecosystem of projects now forms the backbone of innovative, enterprise-grade data management projects.
Many of the world’s largest banks and capital markets firms have turned to Hortonworks Data Platform (HDP) as their open-source choice for storing and processing data at rest. Over the years, many of those customers (and others outside of financial services) told Hortonworks that they could derive more value from their data science workloads if they had better tools for data in motion. They told us that their investments in HDP would become more valuable if they had an easier way to move data within their data centers into HDP or to enrich existing datasets with easier, more secure ingest of external data sources that may arise in the future.
In response to that demand, Hortonworks launched Hortonworks DataFlow (HDF), an open-source platform based on Apache NiFi. HDF collects, conducts and curates data in motion, moving it from any source to any destination (such as HDP).
Working together, HDP and HDF form Connected Data Platforms for both data at rest and in motion. Modern Data Applications deliver actionable intelligence from the Connected Data Platforms to data scientists at banks. For example, Hortonworks customers are speeding their ability to detect and stop fraudulent activity. They are more confidently predicting future market movements. They are launching innovative products and services that monetize their ability to understand changes in customer preferences—based on the aggregate money movement trends within their institutions.
Data Science Only Works with Both Data at Rest and Data in Motion
But all of this innovation is only possible with data. Banks and capital markets companies run on data—data on deposits, payments, balances, investments, interactions and third-party data that quantifies their risk of theft or fraud.
Modern data applications for banking data scientists may be built internally or purchased “off the shelf” from third parties. These new applications are powerful and fast enough to detect previously invisible patterns in massive volumes of real-time data. They also enable banks to proactively identify risks with models based on petabytes of historical data. These data science apps comb through the “haystacks” of data to identify subtle “needles” of fraud or risk not easy to find with manual inspection.
The modern data applications make data science ubiquitous. Rather than back-shelf tools for the occasional suspicious transaction or period of market volatility, these applications can help financial firms incorporate data into every decision they make. They can automate data mining and predictive modeling for daily use, weaving advanced statistical analysis, machine learning, and artificial Intelligence into the bank’s day-to-day operations.
For my next post in this series, I will examine specific data science use cases across three different banking workloads.
More On Data Science and Hortonworks for Financial Services: