Insights from DataWeek: San Francisco

I spent some time at the first ever DataWeek in San Francisco last week.  It is a brand new show and it was very well-run, spread across a few cool spaces with an interesting mix of novice to experienced data professionals.  They had a good blend of labs, speakers, panels and great networking opportunities.  In all, it was great and a big thanks and kudos to the organizers.

I took part in a panel and also presented a three-hour overview of Hadoop.  There were some good questions thrown at the panel but more interesting was the discussion over the three sessions.  Before each presentation, I ran an informal survey of the room to get a sense of audience and there was an even mix of complete novice, those new to Hadoop and experienced practitioners.

Each session had lively discussion and great engagement.  There were three segments to the presentation: Hadoop market overview, Intro to Hadoop, Hadoop usage patterns.  I would also say that, in general there were three key points that the audience really seemed to focus on.

Forest/Trees :: Distribution/Project
There are Hadoop distributions and there is the Apache Hadoop project.  When you are new to this world and learning through all the media, you can get lost in this terminology and the clarification of this point seemed important to the some of the Dataweek crowd.

The conversation went a little like this… the Apache Hadoop project comprises MapReduce and HDFS.  Sometimes we refer to this as “core Hadoop” as it is the central focus of a Hadoop project. It provides redundant and reliable storage and distributed processing or compute. In order for Hadoop, the project, to become a more complete data platform, we, the community have created several related projects that make Hadoop more useful and dependable. When we package these projects (Hive, HBase, Pig, HCatalog, Ambari, ZooKepper, Oozie, etc…) with core Hadoop, this becomes a “distribution”.

A distribution came about because each project has its own release cycle and getting the right versions together is sometimes difficult.  Also, a distribution will package the projects and provide an installer to make deployment much easier.

Insatiable Thirst for Use Cases
Design Patterns by Gamma et al. has and always will be one of the best developer books written. I like design patterns because they take a lot of data and boil it down to naturally occurring state.  They make sense of chaos.

In the third hour of our overview, we presented some reusable patterns of use for Hadoop, namely, Refine, Explore and Enrich.  With refine we apply a known process to a set of big data to extract results and use them in a business process.  With explore, we use Hadoop to discover new information that was not attainable before.  Often with explore, we will operationalize findings to be used in the refine patters.  Finally with enrich we use big data to supplement and improve a user experience for an online application.

This session was scheduled for 45 minutes and went the full hour and beyond.  There were a LOT of questions and interactions.  The material was well received by the experienced professionals as it made sense of their projects and for those new to Hadoop it provided a good sense of where to start or how to approach this big data thing.

We Face Challenges
It seemed everyone wants to get started but are presented with challenges.  There were really three areas of focus in this discussion, acquiring skills, managing a cluster and building a business case. The business case and validation of a project was interesting as some said you should just start with a project and run with it, while others advocated careful planning and a formal process.I guess in the end both sides were right.

It depends on your org and what they can stomach really.I will add my two cents however…  Hadoop is open source and available to you today so use it and start addressing all three of the challenges in the immediate future.

As noted, Dataweek was a huge success and I am honored to have taken part in what surely will be a regular event.  Congrats to the organizers on the birth of a new show.

Categorized by :
Apache Hadoop Big Data Hadoop Ecosystem Industry Happenings

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.

Thank you for subscribing!