Data science holds tremendous potential for organizations to uncover new insights and drivers of revenue and profitability. Big Data has brought the promise of doing data science at scale to enterprises, however this promise also comes with challenges for data scientists to continuously learn and collaborate. Data Scientists have many tools at their disposal such as notebooks like Juypter and Apache Zeppelin & IDEs such as RStudio with languages like R, Python, Scala and frameworks like Apache Spark. Given all the choices how do you best collaborate to build your model and then work through the development lifecycle to deploy it from test into production?
Why Data Science on Big Data?
In this meetup you will cover the attributes of a modern data science platform that empowers data scientists to build models using all the data in their data lake and foster continuous learning and collaboration. We will show a demo of Apache Zeppelin, Apache Spark, Apache Livy and Apache Hadoop with the focus on integration, security and model deployment and management.
Data Science at Scale DEMO
The demo will cover the Data Science life cycle: develop model in team environment, train the model with all the data on a Hadoop cluster, deploy model into production. The model will be a Spark ML model
Practical ML with Apache Spark
To deliver machine learning solutions data scientists not only need to fit models but also do familiar tasks data collection & wrangling, labelling, feature extraction and transformation, model tuning and evaluation, etc. Apache Spark provide provides a unified solution for all this under the same framework.
For example, one can use Spark SQL to generate training data from different sources and then pass it directly to MLlib for feature engineering and model tuning, instead of using Hive/Pig for the first half and then downloading the data to a single machine to train models in R. The latter is actually very common in practice but painful to maintain. Spark MLlib makes life easier for data scientists and machine learning engineers so that they can focus on building better ML models and applications.
We will discuss the underlying principles required to develop practical machine learning and data science pipelines and show some hands-on experience using Apache Spark to solve typical machine learning and data science problem. We will also have a short discussion about how Spark MLlib faces challenges from other machine learning libraries such as TensorFlow and XGBoost.
6:30 – 7:00 PM – Networks and Pizza
7:00 – 7:20 PM – Why Data Science on Big Data?
Hellmar Becker – Solutions Engineer, Hortonworks
7:20 – 7:50 PM – Data Science at Scale Demo
Marc Decker – Analytics Sales Engineer, IBM
7:50 – 8:15 PM – 3rd talk – to be announced
8:15 – 9:00 PM – Drinks and Networking