The value of any data is proportional to the insights derived from it. With the Data Lake Architecture, all of the enterprise data is made available in one place. The key to driving insights from the Data Lake is Apache Spark & Apache Zeppelin. Both are key tools to drive Predictive Analytics and Machine Learning. The latest release of HDP delivers several key features and improvements to Spark & Zeppelin that help advance predictive analytics and machine learning.
Apache Spark 2.1.1 is now GA with HDP 2.6. This Spark release is the most stable & feature rich release on Spark 2 code. The primary focus of Spark 2.1 release is on Structured Streaming, Machine Learning, and SparkR. Spark Streaming now leverages Apache Kafka .10 release and can take advantage of Kafka connection over SSL. Structured Streaming continues to get more mature but it is still Alpha and we don’t recommend using it in mission critical production applications until it becomes more mature. You can try out Spark 2.1 quickly in Hortonworks Data Cloud.
Most data scientists use R & Python and with SparkR & PySpark respectively they can continue to leverage their familiarity with the R & Python languages. However, they need to use the Spark API to leverage Machine learning with Spark and to take advantage of distributed computations. Both SparkR & PySpark are evolving rapidly and SparkR now supports a number of machine learning algorithms such as LDA, ALS, RF, GMM GBT etc. Another key improvement in SparkR is the ability to deploy a package interactively. This will help Data Scientists deploy their favorite R package in their own environment without stepping on other users.
PySpark now also supports deploying VirtualEnv and this will allow PySpark users to deploy their libraries in their own individual deployments.
Perhaps the most critical feature in this Spark release is Spark’s integration with LLAP & Ranger. This integration delivers fine-grained access control to SparkSQL. Now security admin can specify row/column level access control and masking for SparkSQL. Now SparkSQL has the same fine-grained access control that Apache Hive users have had access to.
With HDP 2.6 release we have delivered Livy to provide REST-based access to Spark. REST-based access to Spark is useful for large enterprises who want to provide remote access to Spark users without having to open their cluster. REST access also removes the need to deal with Kerberos.
Spark jobs often interact with other HDP components, for example, they read from HDFS and run on YARN. Tracing system calls across these components is difficult and it is hard to correlate actions. With this release, we have provided a way to correlate actions across components that make debugging complex Spark jobs easier.
Apache Zeppelin 0.7
This release of HDP delivers version 0.7.1 of Apache Zeppelin. The key improvement in Zeppelin 0.7 is support for Apache Spark 2.1. Another big improvement is in Zeppelin’s integration with Livy. With this release, Zeppelin’s Livy interpreter discovers expired session automatically and does not need to be restarted upon Livy session expiry. Another key improvement was support for multi-line SQL statement in JDBC interpreter.
HDP 2.6 is a major release for Apache Spark & Zeppelin that introduced a number of new key features. Please try the latest release and we look forward to your feedback so that we can continue to improve.
This release will not be possible without the tremendous help by Apache communities, our customers, and users of Spark & Zeppelin. We sincerely thank you all. We are very excited about the future of Spark & Zeppelin, the best is yet to come.