Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
August 06, 2018
prev slideNext slide

Distributed Pricing Engine using Dockerized Spark on YARN w/ HDP 3.0 [Part 1/4]

This is the 1st blog in a 4-part blog series where we will look at an architectural approach to implementing a distributed compute engine for pricing financial derivatives using Hortonworks Data Platform [HDP] 3.0.

In this blog, we will discuss the problem domain and set the context before we zoom in on the functional and technical aspects.

Modern financial trading and risk platforms employ compute engines for pricing and risk analytics across different asset classes to drive real-time trading decisions and quantitative risk management. Pricing financial instruments involves a range of algorithms from simple cashflow discounting to more analytical methods using stochastic processes such as Black-Scholes and computationally intensive numerical methods such as finite differences, Monte Carlo and Quasi Monte Carlo techniques depending on the instrument being priced – bonds, stocks or their derivatives – options, swaps etc. and the pricing (NPV, Rates etc) and risk (DV01, PV01, higher order greeks such as gamma, vega etc.) metrics being calculated. Quantitative finance libraries, typically written in low level programming languages such as C, C++ leverage efficient data structures and parallel programming constructs to realize the potential of modern multi-core CPU, GPU architectures, and even specialized hardware in the form of FPGAs and ASICs for high performance compute of pricing and risk metrics.

Quantitative and regulatory risk management and reporting imperatives such as valuation adjustment calculations XVA (CVA, DVA, FVA etc.), BCBS239 for FRTB, CCAR, DFAST in the US or MiFID in Europe for instance, necessitate valuation of portfolios of millions of trades across tens of thousands of scenario simulations and aggregation of computed metrics across a vast number and combination of dimensions – a data-intensive distributed computing problem that can benefit from:

  • Distributed compute and data-parallel frameworks such as Apache Spark and Hadoop that offer scale-out, shared-nothing and fault-tolerant architectures that are more portable and have more palatable APIs with a focus on leveraging data locality with commodity hardware as compared to relying high speed interconnects between compute and storage on high end hardware as with HPC frameworks such as MPI, OpenMP etc.
  • Elasticity and operational efficiencies of cloud computing especially with burst compute semantics for these use cases augmented by the use of OS virtualization through containers and lean DevOps practices

In part 2, of the 4 part blog series, we will look at the representative pricing semantics and the technical architecture to help capture the very essence of this problem space through a trivial implementation of the compute engine that combines the facilities of parallel programming using QuantLib, an open source library for quantitative finance embedded in a distributed computing framework Apache Spark running in an OS virtualized environment through Docker containers on Apache Hadoop YARN as the resource scheduler and the distributed data operating system provisioned, orchestrated and managed in OpenStack private cloud through Hortonworks Cloudbreak all through a singular platform in the form of HDP 3.0!!



Richard says:

Care to share why you would run Spark in containers on YARN rather than just running Spark on YARN, as has been possible for several years now? Is there any benefit at all? Performance? Multi version? It sounds like a lot of complexity, and the fact that you need 4 blogs to spell it all out seems to confirm that.

Kevin K. says:

The goal here is to be able to run any type of development that includes specific libraries directly instead of having to go & install & configure all of your nodes for the execution of a simple python or R dev.
You can see more on this link:
And that’s actually one of the biggest problems now with hadoop clusters (for example the gestion of python versions & packages installed on each node).

On the complexity part, yes it’s complex but not as complex as having to setup each and every node for each and every development made on the platform by each and every user.

Hope that it’s a good enough reason for you 🙂

Simon says:

I think the point here is that the QuantLib library being used here is implemented in c++, with a JNI wrapper to make it callable from Java. Bundling a pure-java library with a spark application is easy, and a container would probably not be needed. However the c++ library has dependencies on other libraries, eg libc. And in a yarn cluster, you have no guarantee that the nodes have identical native libraries, or that those libraries match your development environment. Using a container means that there is a consistent set of libs for native code.
The same problem occurs when running Python code which calls into native libraries (which is quite common) – you can bundle the python code, but bundling the underlying native libs is not a very stable approach. Requiring the sysadmins to install a specific set of python libs on each worker node in the cluster is also – not very scaleable..

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums