Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

closeClose button

pySpark, Ipython Notebook and SparkSQL as a Environment for Data Science

Abstract: Data Science on Hadoop can be a daunting journey as you generally are spanning multiple tools and different interfaces. Furthermore, while there are people out there doing data science, worked examples are few and far between.

As part of the Social Security Act, the Center for Medicare and Medicaid Services has begun to publish data detailing the relationship between physicians and medical institutions. This data has been analyzed cursorily in the press, but an in-depth outlier and benford’s law analysis hasn’t been attempted (to my knowledge).

Casey will present a demo using Spark and Hive to do the above analysis without leaving IPython notebook.

Speaker: Casey Stella is a Principal Architect at HortonWorks and focus’ on issues around data science and especially natural language processing at scale. He has domain knowledge in medical/clinical informatics and oil/gas data analysis and signal processing at scale.

Thursday, May 28, 2015
OWS-150 (Owens Science Hall), University of St. Thomas 2115 Summit Avenue, Saint Paul, MN