cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button

The Hortonworks Blog

More from Ofer Mendelevitch

This three part series is co-authored by Ofer Mendelevitch, director of data science at Hortonworks, and Jiwon Seo, Ph.D. and research assistant at Stanford University. Introduction This is the third part of the blog-post series about anomaly detection from healthcare data. In part 1, we described the dataset, the business use-case and our general approach […]

This three part series is co-authored by Ofer Mendelevitch, director of data science at Hortonworks, and Jiwon Seo, Ph.D. and research assistant at Stanford University. Introduction This is the second part of our blog-post series about anomaly detection from healthcare data. As described in part 1, our goal is to apply the personalized-PageRank algorithm to […]

This three part series is co-authored by Ofer Mendelevitch, director of data science at Hortonworks, and Jiwon Seo, Ph.D. and research assistant at Stanford University. Introduction PageRank[1]is the poster-child of graph algorithms, used by Google in its original search engine system to determine which web pages are most influential. The incredible success of PageRank led […]

As a data scientist working with Hadoop, I often use Apache Hive to explore data, make ad-hoc queries or build data pipelines. Until recently, optimizing Hive queries focused mostly on data layout techniques such as partitioning and bucketing or using custom file formats. In the last couple of years, driven largely by the innovation of […]

In our series on Data Science and Hadoop, predicting airline delays, we demonstrated how to build predictive models with Apache Hadoop, using existing tools. In part 1, we employed Pig and Python; part 2 explored Spark, ML-Lib and Scala. Throughout the series, the thesis, theme, topic, and algorithms were similar. That is, we wanted to […]

Introduction In this 2nd part of the blog post and its accompanying IPython Notebook in our series on Data Science and Apache Hadoop, we continue to demonstrate how to build a predictive model with Apache Hadoop, using existing modeling tools. And this time we’ll use Apache Spark and ML-Lib. Apache Spark is a relatively new […]

Introduction With the rapid adoption of Apache Hadoop, enterprises use machine learning as a key technology to extract tangible business value from their massive data assets. This derivation of business value is possible because Apache Hadoop YARN as the architectural center of Modern Data Architecture (MDA) allows purpose-built data engines such as Apache Tez and […]

A lot of people ask me: how do I become a data scientist? I think the short answer is: as with any technical role, it isn’t necessarily easy or quick, but if you’re smart, committed and willing to invest in learning and experimentation, then of course you can do it. In a previous post, I described […]

In a recent blog post I mentioned the 4 reasons for using Hadoop for data science. In this blog post I would like to dive deeper into the last of these reasons: data agility. In most existing data architectures, based on relational database systems, the data schema is of central importance, and needs to be […]

Data scientists are in high demand these days. Everyone seems to be hiring a team of data scientists, yet many are still not quite sure what data science is all about, and what skill set they need to look for in a data scientist to build a stellar Hadoop data science team. We at Hortonworks […]

Over the last 10 years or so, large web companies such as Google, Yahoo!, Amazon and Facebook have successfully applied large scale machine learning algorithms over big data sets, creating innovative data products such as online advertising systems and recommendation engines. Apache Hadoop is quickly becoming a central store for big data in the enterprise, […]