This is another great European customer guest blog post authored by Joan Viladrosa, Tech Lead & Senior Big Data Engineer at Billy Mobile. You can hear more about their solution by joining our live webcast February 23rd. Register here.
About Billy Mobile
As a mobile ad exchange with a large marketplace of direct publishers and advertisers working only on a performance basis, Billy Mobile is a leading mobile marketing company. Focused on connecting local and international Advertisers, Agencies and Media Representatives with high quality users, we pride ourselves with delivering the best advertising optimisation. We offer excellent services to advertisers and increased profit to mobile Ad Networks, Webmasters and Developers. Our aim is to connect local mobile markets to the world and in order to do so, we work with more than 500 advertising Agencies, Ad Networks, SSP, DSP and Ad Exchanges from over 30 countries.
Our Data Challenge:
In mobile advertising, showing the right advertisement to the right user at the right time isn’t enough. It needs to be done in a few milliseconds. If an ad doesn’t appear immediately, the viewer’s attention will drift and negatively affect the chance of a conversion. Speed becomes even more vital when programmatic buying and real time bidding are involved. This speed needs to occur while handling large numbers of petitions per second, not just to display an ad, but also to analyze performance and learn from viewers’ behaviour so as to improve our prediction algorithm and in turn benefit our publishers and advertisers.
Every day we receive hundreds of millions of requests, resulting in 100s of TB a month of raw data. All this data has to be collected, processed and stored in a way we can use it for a lot of different purposes: improving our artificial intelligence, analyzing what’s happening under the wires by our Data Science, Business Intelligence and Sales Team and reporting to both our Publishers and Advertisers.
We need to keep all of our data, petabytes, for years in order to service our data science team as well as customers. Creating a cost effective, reliable and useful archive of this huge amount of data heads us inevitably to Hadoop.
An open source platform solution
To talk about Hadoop and Open Source is to talk about Hortonworks Data Platform, the only 100% pure open source distribution. Our solution relies on almost every top level project of Hadoop.
Data Ingestion is accomplished through Apache Kafka, the distributed commit log solution par excellence. Apache Storm comes next for real time processing of our unbounded streams of data. Apache Hive is our data warehouse, where everything is stored for good. Also, we have an entity-cache in Apache HBase, and of course Apache Spark for analyzing the data and computing the prediction model of our system. The fine-grained access control and centralized audit mechanism provided by Apache Ranger, and the single point of access by Apache Knox allows us to keep data secure for each of our publishers and advertisers.
Some other companies rely on Hadoop for a small part of their business, or some side or analytical features, but here in Billy Mobile, our core systems and features fully depend on Hadoop. We totally trust and believe in the Hadoop ecosystem.
The advanced AdExchange platform architecture details
The AdExchange front-end servers receive all the traffic; impressions, clicks, conversions and other events are serialized into Avro and pushed to their respective topics into Kafka, a distributed, scalable and reliable system. At the moment, all our frontend servers are in the same data center as the Hadoop Cluster, so they send the data directly to Kafka.
The next player is Apache Storm, our ETL Engine, which reads from the previous Kafka topics, and after some basic calculations and joins (for example the conversion with its impression), stores data in Hive. Here we are also using HBase as entity-cache for its key-value low-latency storage. We store the impression as soon as we get it, and we retrieve it when a conversion, click or other event arrives to use its information in the calculations.
Traditionally, adding new data into Hive requires gathering a large amount of data onto HDFS and then periodically adding a new partition. This is essentially a “batch insertion”. Insertion of new data into an existing partition is not permitted. We wanted our data to be available as soon as possible and ready to use, so this was not an option for us. Hive Streaming API allows data to be pumped continuously into Hive. The incoming data can be committed in small batches of records into an existing Hive partition. Once data is committed it becomes immediately visible to all Hive subsequent queries.
The streaming API is specifically designed for clients such as Storm, which continuously generate data. Streaming support is built on top of ACID based insert/update support in Hive, which is available since HDP 2.3, although the streaming feature is in beta, it works well for us.
Once the data is in our data warehouse, we can use it for analytics and running our prediction algorithm. Here is when Spark shines, with fast in-memory processing, we can process and update our prediction model every hour. Each execution we read more than 300 GB of data, which includes impressions, conversions, clicks and other events, resulting in a prediction model, which tells us the best offer, for every possible targeting.
This model is then pushed to our Brain, a service which the AdExchange uses for choosing which offer to show given all of the user and HTTP request traits.
Also, the data in Hive is queried by our Admin Cluster, for reporting to our Publishers and Advertisers, and also for our Sales and BI Team. This Panel also has a MariaDB cluster backend for all the information that is not yet Big Data.
Hardware and Services:
Our Front Cluster is the responsible for receiving all the incoming traffic, so we take special care here in its redundancy and reliability. We have 2 HAProxy server (balanced with DNS roundrobin) distributing the traffic through 5 Front Servers. Those machines have 16 cores and 96 GB of RAM and 240 GB SSD for storage. The storage in those front machines should be fast, but we don’t need huge amount of space here.
Regarding the hardware of the HDP cluster, we made the choice on DELL R730xd the configuration we have set with 24 Core (48 threads with HyperThreading), 384GB RAM and 12 disk with 2TB. Our cluster is focused to computing, as most of our workload goes to the prediction model calculation with Spark; this is the reason we choose to break the 1 CPU -> 8 GB RAM -> 1 HDD rule of thumb. Since most of our algorithm is in-memory, we get good IO-throughput from HDDs with this configuration.
Monitoring and deploying:
For monitoring and deploying our Hadoop Cluster, we use Ambari, which simplifies the provisioning and management of the services.
We also use Zabbix for monitoring all the servers and plotting all the metrics.
For analyzing the performance of our own services (AdExchange, Brain, Postback and Callbacks…), we use statsD with Graphite and Grafana.
We have plans to migrate to a multi data center architecture, having some edge-location frontend servers for cutting the response time to our users, and improving the conversion rates. Due to this, we are looking into solutions as Hortonworks Data Flow (HDF), powered by Apache NiFi, to collect and transport data from a multitude of sources, always connected or intermittently available.
Also, in order to improve the interactivity of our Admin Panel Queries to Hive, we are looking into Apache Kylin, a recent project providing fast OLAP cubes on top of Apache HBase. This way we expect to avoid huge queries reaching Apache Hive and cannibalizing our YARN cluster, improving its response time.
For more information
Read the press release here.
Attend the free webinar with Billy Mobile here.
Read more about how Hortonworks helps advertisers here.