Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
October 19, 2017 | Shelby Khan | Dataworks Summit

7 Sessions From DataWorks Summit Sydney You Should See

October 18, 2017 | Kevin Jordan | Hortonworks Case Study

How Much Can You Trust Your Big Data?

October 16, 2017 | Matt Spillar | Hortonworks Case Study

Leveraging Data to Make Decisions in Financial Services

Viewing posts by: Vadim Vaks« Back to all

X
FILTERS
ALL
TECHNICAL
BUSINESS

All Topics















All Channels











CLEAR FILTERS

Apache Spark has ignited an explosion of data exploration on very large data sets. Spark played a big role in making general purpose distributed compute accessible. Anyone with some level of skill in Python, Scala, Java, and now R, can just sit down and start exploring data at scale. It also democratized Data Science by […]

It’s never been easier to get started with Apache Hadoop. The Hortonworks Sandbox combines 100% open-source Apache Hadoop and its data access engines (Apache Spark, Apache Hive, Apache HBase, Apache Solr, Apache Pig) with enterprise-grade Operations (Apache Ambari), Security (Apache Ranger and Apache Knox) and Governance (Apache Atlas).  The Sandbox also provides tools for devOps, […]

You may have seen our invite to join the genomics consortium Let me recap a little about what this is about and catch you up to speed on our progress and next steps. Hortonworks is quarterbacking a consortium of leading healthcare organizations and subject matter experts to help develop the platform requirements for next generation […]

Geospatial data is pervasive—in mobile devices, sensors, logs, and wearables. This data’s spatial context is an important variable in many predictive analytics applications. To benefit from spatial context in a predictive analytics application, we need to be able to parse geospatial datasets at scale, join them with target datasets that contain point in space information, […]

Drink from Elephant’s Well Of Knowledge Developer success starts with open and reusable code, and a community that allows for both consumption of code and contribution of updates to the code base. This success engenders a thriving and evolving community. To that end, today we are announcing the Hortonworks Gallery for developers. Located on GitHub, the […]

Apache Spark provides a lot of valuable tools for data science. With our release of Apache Spark 1.3.1 Technical Preview, the powerful Data Frame API is available on HDP. Data scientists use data exploration and visualization to help frame the question and fine tune the learning. Apache Zeppelin helps with this. Based on the concept […]

Introduction Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs in Scala, Java, and Python that allow data workers to efficiently execute machine learning algorithms that require fast iterative access to datasets. Spark on Apache Hadoop YARN enables deep integration with Hadoop and other YARN enabled workloads in the […]

Hortonworks is pleased to announce the general availability of Apache Spark in Hortonworks Data Platform (HDP)— now available on our downloads page. With HDP 2.2.4 Hortonworks now offers support for your developers and data scientists using Apache Spark 1.2.1. HDP’s YARN-based architecture enables multiple applications to share a common cluster and dataset while ensuring consistent […]

Hortonworks is excited to announce that our first hands-on, performance based certification exam is now available! The HDP Certified Developer (HDPCD) exam is designed for Hadoop developers working with frameworks like Pig, Hive, Sqoop and Flume. This new approach to Hadoop certification is designed to allow individuals an opportunity to prove their Hadoop skills in […]

This three part series is co-authored by Ofer Mendelevitch, director of data science at Hortonworks, and Jiwon Seo, Ph.D. and research assistant at Stanford University. Introduction This is the third part of the blog-post series about anomaly detection from healthcare data. In part 1, we described the dataset, the business use-case and our general approach […]

This three part series is co-authored by Ofer Mendelevitch, director of data science at Hortonworks, and Jiwon Seo, Ph.D. and research assistant at Stanford University. Introduction This is the second part of our blog-post series about anomaly detection from healthcare data. As described in part 1, our goal is to apply the personalized-PageRank algorithm to […]

This three part series is co-authored by Ofer Mendelevitch, director of data science at Hortonworks, and Jiwon Seo, Ph.D. and research assistant at Stanford University. Introduction PageRank[1]is the poster-child of graph algorithms, used by Google in its original search engine system to determine which web pages are most influential. The incredible success of PageRank led […]

As a data scientist working with Hadoop, I often use Apache Hive to explore data, make ad-hoc queries or build data pipelines. Until recently, optimizing Hive queries focused mostly on data layout techniques such as partitioning and bucketing or using custom file formats. In the last couple of years, driven largely by the innovation of […]

In our series on Data Science and Hadoop, predicting airline delays, we demonstrated how to build predictive models with Apache Hadoop, using existing tools. In part 1, we employed Pig and Python; part 2 explored Spark, ML-Lib and Scala. Throughout the series, the thesis, theme, topic, and algorithms were similar. That is, we wanted to […]

Introduction In this 2nd part of the blog post and its accompanying IPython Notebook in our series on Data Science and Apache Hadoop, we continue to demonstrate how to build a predictive model with Apache Hadoop, using existing modeling tools. And this time we’ll use Apache Spark and ML-Lib. Apache Spark is a relatively new […]