Category Archives: Industry Happenings


UC Irvine Health: Improving Quality of Care with Apache Hadoop (Part 2)

This is the second part of a series written by Charles Boicey from UC Irvine Health (part 1 is here). The series will demonstrate a real case study for Apache Hadoop in healthcare and also journal the architecture and technical considerations presented during implementation.

UC Irvine Health new logo

It has been 232 days since the last post. Much has transpired including a rebranding of the organization from UCI Medical Center to UC Irvine Health. I am happy to report we have a production Saritor environment up and running on the Hortonworks Data Platform.

Here are some highlights from the past 232 days:

Home Monitoring

In collaboration with our medical device integration partner, iSirona, we are developing a system to acquire home monitoring data and transmit it to Saritor. Our first deployed device will be a scale. This may sound simple, but in-home monitoring of the daily weights of Congestive Heart Failure patients is essential for the prevention of those patients readmitting to the hospital.

Home monitoring data will not be transmitted directly to the Electronic Medical Record (EMR), for a very specific reason. Home device data from thousands of patients transmitted directly to the EMR would be a nightmare for clinicians to manage. It would be too much data. By sending the data to Saritor first, an algorithm can determine which changes in weight indicate risk of re-admittance and then notify clinicians about those cases. All home monitoring data will be viewable in the EMR via an API to Saritor.

In-Hospital Monitoring

We are working on a pilot to enhance patient monitoring in the hospital. In California, nurses typically have up to five patients to care for, and it can be challenging to be with a patient at the bedside and also keep a close eye on all the small changes in vitals across all patients.

Soon, hospitals will be able to provide each new inpatient with a wearable disposable patch that monitors vital signs such as heart rate, temperature, pulse oximetry and wirelessly transmit that data every minute to Saritor. An algorithm can “watch” that data for patterns that the nursing team might not be able to catch. Because nurses cannot watch a monitor for every minute of their shift, Saritor has “got their back”. Nurses can go about the business of caring for patients and Saritor will notify them when there is a disturbing pattern in a patient’s vitals. A data warehouse might be able to run a similar algorithm, but with 24-hour latency. That’s too much latency for a nurse to respond quickly to an emergent situation.

Patient Self-Monitoring

With the increasing numbers of patients joining the “Quantified Self” movement we see Saritor as the ideal environment to help receive more health data generated by the patients themselves. We want to store and make use of patient-generated data from personal health records and home monitoring. Sites such as Fitbit, 23 and Me and others could also feed in data. With open APIs to a patient’s personal health record this data can be ingested into Saritor and then be made available to clinicians via the EMR. Score cards from the EMR data in Saritor can also be pushed back out to the patients.

Other Lessons We’ve Learned

Hadoop Plays Well with Others

One awesome discovery we made was that the Hadoop Ecosystem plays well with other systems. We were able to start ingesting data into Hadoop, without having to change anything within the current IT environment. For example, all of the healthcare data ingested into Saritor goes into HDFS. For the monitoring of inpatients, Map Reduce jobs run against HDFS and then push that data into MongoDB. Algorithms in Mahout run against the data in MongoDB and can push notifications to the EMR via an event engine.

For graph analysis of healthcare data MapReduce jobs run against HDFS and then output in graph form for input into Neo4j.

Legacy Healthcare Data Is Valuable

We ended up with 9 million patient records spanning 22 years and 1.2 million patients. Our original estimate was 3 million records. We are using this data to build our surveillance algorithms.

Social Media Is an Important New Source of Information

Saritor is capable of storing social media data related to UC Irvine Health and a UCI student project is underway to develop a sentiment analysis dashboard to better understand the social media environment external to UC Irvine Health. As part of the patient experience feedback loop we will be able to reach out and connect with patients to better understand their concerns so that we can enhance the patient experience.

Others in the Healthcare Community Are Interested in Adopting Hadoop

I’ve spoken with many other healthcare providers that are trying to solve the same type of problems, all are eager to exchange Hadoop best practices.

In the next installment, I’ll give an update on the results of our monitoring pilots, describe our progress on surveillance algorithms, and tell you more about our collaboration with other hospitals and clinics.

If you’re considering your own Hadoop implementation, then click here to learn more about Hortonworks Data Platform, and here to understand how it might work for your business with our whitepaper, Hadoop Patterns of Use.

 

Hadoop Market Momentum and You

On 27th March, the Wall Street Journal published an article ‘VCs Bet Big Bucks on Hadoop’ and it seems clear that the market is going to be huge. But what does that mean to you and your personal skills investment? Here’s our view:

Hadoop is HOT

Hadoop is incredibly hot right now as the number of available jobs continues to grow enormously (hey – we even have a bunch of our own right here).

Indeed’s Job Trends shows Hadoop as 7th hottest skill and it’s in great company alongside those app development skills such as iOS, Android and jQuery. I guess that’s to be expected of course: insights from big data is the fuel to smartest apps of the future.

The Hadoop trend itself is fairly clear. In growth terms, that is pretty explosive!

Indeed Job Trends

 

A quick search on LinkedIn will pull back around 1200 Hadoop jobs right now (it was 1281 when I checked). And you can also look at the Skills page to see the associated set of component technologies and their relative growth.

Hortonworks is HOT

Apart from the WSJ, just last week, MomentumIndex called out Hortonworks as the 2011 Startup with the most Momentum from a pool of 900 startups being tracked from that year.

We also know when we talk to customers that they’re excited about our approach to pure, community-driven, open source Hadoop. We know developers are excited to get hands on with Hadoop via the Sandbox. And we say great public responses like those we saw at Hadoop Summit Amsterdam, that our approach is the right one.

Hadoop, Hortonworks and YOU are HOT

Hortonworks believes in Hadoop and we believe in the power of community-driven open source. We know that this is just the beginning for Hadoop and we back everyone investing their skills in Hadoop, and taking this journey with us. All the way.

Get Started: You can get started by downloading our Sandbox - it’s a VM package containing everything you need to run a single node cluster (I love that expression!) and is packed with tutorials and demos.

Get Connected: Stay in touch. When we say community we mean it – come follow us on TwitterFacebookLinkedIn- we want to hear from you as to how we’re doing to provide you with the tools and capabilities to do what your business is demanding. Find a Hadoop User Group (HUG), and come along to the Hadoop Summit.

Get Certified: If you want to differentiate yourself and grab one of those jobs, then you can train and certify with us too. All of the details on that are here.

Dive in and enjoy.

Hadoop Summit 2013 Amsterdam – It’s A Wrap!

We want to take a moment to thank everyone who attended the Hadoop Summit in Amsterdam - THANK YOU! With nearly 500 people registered for the event we think we can safely say is was a big success. We’ve had overwhelming support to do it again next year – so watch this space.

The awesome Beurs Van Berlage venue set us up for a series of fantastic conversations and really well attended sessions and talks as Hadoop continues to explode onto the enterprise scene . Outside of the main tracks, there was great attendance for NLHUG and BoF talks, and kudos to the 10 presenters who ran those lightning talks. Finally, the customer panel was also well received, with great practical advice on adopting Hadoop from HSBC, Neustar and eBay.

But of course it wouldn’t be an event without a party, and we had a great time at the Heineken Experience (from what we can remember).  We put some photos on our Facebook page, but @timoelliott did a much better job than us with this fantastic set on Flickr. This one shows the awesome venue:

hadoop summit exhibition hall

So did you enjoy the summit?  Head over to Facebook  and let us know your favorite part and why: keynotes, tracks, lightning talks, the sandbox experience in the dev cafe, or the party.

And here is a tiny selection of some of the most recent Tweets closing out the show:

Hadoop Summit Tweet

Hadoop Summit Tweet

Hadoop Summit Tweet

Hadoop Summit Tweet

With the community voting just about complete - you still have a few hours to take part – for Hadoop Summit San Jose we are barely 3 months away from a whole bunch of new content and connections and we hope you join us there too!

Thanks again!

Week in Review: From Plastics to Windows

We’re wrapping up another busy week at Hortonworks towers. I say another, but actually this is my first week. So… it’s a hello from me, I’m Marc Holmes, Community Director. What have we been talking about this week?

Plastics and Hadoop: discuss! We started the week with a post from our VP of Products, Bob Page drawing an analogy to the growth of the plastics industry with the disruption to the database market driven by Hadoop, looking at the connections and differences to SQL and pointing out ‘what we don’t know yet’ on the evolution of use cases for Hadoop.

Hadoop and Windows sitting in a tree… Arun and Suresh highlighted the joint effort between Hortonworks and Microsoft to make Apache Hadoop run natively on Windows, and celebrated the community vote to move this work into the mainline trunk. We’re community-driven open source folk and we’re delighted not only by the code, but the spirit of community contribution throughout. Microsoft talked about this work over on their Port 25 blog.

Out there. Meantime, there was a LOT of discussion on a couple of articles including this one - Proprietary Hadoop is a Losing Strategy - and this one - One Hadoop Distribution To Rule Them All as a follow up. We believe, and Arun points out, that ‘ultimately the winners in Hadoop will be those investing most heavily in its success’.

But what do you think at a personal level? Do you want Hadoop skills, or Hadoop-a-like skills? Let us know.

And finally, talking of skills, Russell Jurney explained how to Install Hadoop on Windows. So now you know.

Next week… should be quiet. Only the Hadoop Summit in Amsterdam, and a bunch of exciting stuff we’ll tell you more about then. Stay out of trouble and enjoy the show!

Plastics, SQL and the Extensible Future of Hadoop

Plastics, SQL and the Extensible Future of Hadoop

Mr. McGuire: I just want to say one word to you. Just one word.

Benjamin: Yes, sir.


Mr. McGuire: Are you listening?

Benjamin: Yes, I am.

Mr. McGuire: Plastics.

 

The advice given by Mr. McGuire in 1967’s The Graduate was certainly prophetic — plastics has become one of the largest manufacturing industries in the U.S. (Today, Mr. McGuire would probably say “Data.” But this post isn’t about career choices.)

Plastics initially found itself taking on familiar roles, providing rough equivalents for materials that were more expensive, in low supply, or some other attribute that made plastics a viable alternative — materials like glass, wood and metal were commonly imitated. But plastics were often seen as a poor replacement. Eventually, two things happened: New uses were found that went far beyond existing use cases, and the technology got better at becoming more like the materials they mimicked.

I think history is repeating itself, this time with Hadoop.

First though: Analyzing all that Hadoop data via native MapReduce doesn’t leverage existing SQL skills and technologies, which represent a significant investment. Because Apache Hive, the most widely used technology that brings SQL to Hadoop, is not a complete implementation of SQL, nor designed for interactive queries, we’re seeing a bevy of announcements, both open source and proprietary, that allow SQL-on-Hadoop to meet those use cases. I will avoid enumerating them — more may have appeared since you started reading this.

These SQL-on-Hadoop efforts are like the early days of plastics. Making Hadoop mimic the characteristics of a relational database query language is important and worth investing in. Some will be discarded as poor imitations — especially for customers that are used to enterprise-class warehouse SQL engines like Teradata. Others will get better, and even implement Hadoop-specific innovations, moving SQL forward. Even for the really good technologies, users are still stuck with a thirty-plus-year-old framework and relational model, regardless of how many UDFs and “calls to Hadoop” functions exist – otherwise many BI tools will need to be modified for each of these one-off implementations. Not to mention the operational overhead of new storage layers, resource management, etc.

To be clear: SQL is incredibly important, and will be for a long time. Making SQL-on-Hadoop is a very high priority across the industry. Including Hortonworks — witness the Stinger Initiative. It just doesn’t demonstrate the firepower of this fully armed and operational battle station. Like plastics, Hadoop is a breakthrough technology platform, and creating innovation is where customers will ultimately get the real value.

Unfortunately, in the rush to meet market demand, many of the SQL-on-Hadoop efforts ignore Hadoop’s emerging architecture. While YARN generalizes the Hadoop resource management framework, Apache Tez generalizes the data processing framework, in order to support an amazing array of future applications. YARN+Tez represents the future of Hadoop, and the future of enterprise data.

I recently met with a customer who had an interesting observation: “I don’t understand why [vendor] would implement SQL outside of YARN and Tez. All the extra resource management, operational cost, all the additional work involved — It is like they don’t really understand where Hadoop is going.” The obvious answer is that nobody builds SQL on YARN and Tez, because YARN and Tez aren’t available today. But that’s a short-term answer. YARN and Tez have wide community support and represent a large investment across the community. The community also continues to invest heavily in advancing Apache Hive. By letting Hive use the Tez speed innovations and freeing it from MapReduce, it’ll get the faster execution analysts need, within the Apache Hadoop framework. If this customer uses a non-Hive solution today, how will that solution compare with Hive on Tez?

Objectively comparing future products to future products is a fool’s errand. Technology moves forward, and the solutions will get better. The issue is really one of technology philosophy and approach. Enterprise customers don’t simply make decisions based on the bits that exist today, and will change tomorrow. They want to make sure they are investing in a future.

Which brings me back to the larger issue.

If yesterday’s “early adopter” Hadoop use case was ETL, and today’s “early majority” is SQL, tomorrow may be streaming, or iterative programming, or machine learning, or something we haven’t thought of yet — but it should all work within the data framework we call Hadoop. That is why at Hortonworks, we’re putting our energy into improving Hadoop, rather than coding around it, or adding proprietary extensions to it. The community investments in HDFS and YARN, and generalizing the fundamental building blocks of Hadoop, will allow us to both create a new data ecosystem that makes Apache Hive a first-class SQL engine and enable a new wave of innovation in integrated data management and analytics. That’s a huge opportunity for the industry, and I’m excited to see what comes next.

Getting Ready for The Elephant Party in Europe

We are just under two weeks away from start of the first ever Hadoop Summit Europe and with all of the final preparations being made we thought we would highlight some of the not to be missed activities in and around the event. The event is filling fast but you can still register here.

Here are 10 great reasons to attend!

1)   Great track content – there are 35 informative sessions on Apache Hadoop and related technologies for you to choose from selected by the community and delivered by the experts themselves.

2)   Great keynotes – leading industry analyst Matt Aslett will present the opening keynote and we will also hear from open source veteran Shaun Connolly as well as Hortonworks CTO Eric Baldeschwieler

3)   Hadoop in the Enterprise expert panel – We will have a live panel discussion from industry leaders incuding eBay, HSBC and Neustar discussing how and why they use Apache Hadoop.

4)   Meetups – the NLHUG and other communities will be meeting around the event.

5)   Lightening talks – we’ve got rapid fire content coming to you in the form of community selected lightening talks. These 5 minute sessions will give you a taste of a wide range of technologies and initiatives

6)   It’s Amsterdam – historic, edgy and fun!

7)   Ecosystem – The conference has the support of the broader Hadoop ecosystem so you can come and discuss Hadoop and big data in the solutions showcase.

8)   Community – The Apache Hadoop community is big and getting bigger. Come meet and mingle with other community members to learn about the latest goings on and make new connections.

9)   Get Hadoop certified – Calling all Hadoop Experts! We’re bringing certification to you! If you are ready to take the exam to become a Hortonworks Certified Apache Hadoop Developer (HCAHD) or a Hortonworks Certified Apache Hadoop Administrator (HCAHA).

10)   Get trained on Hadoop – we’ve got a host of classes available during the event to help you learn or sharpen your Hadoop skills. This includes a newly added Applying Data Science class. Check out the classes.

11)  BONUS reason – have a beer on us at the Hadoop Summit Party at the Heineken Experience a cool venue at a historic location.

Register now, don’t miss the party hope to see you there!

Putting the Elephant in the Window

 

For several years now Apache Hadoop has been fueling the fast growing big data market and has become the defacto platform for Big Data deployments and the technology foundation for an explosion of new analytic applications. Many organizations turn to Hadoop to help tame the vast amounts of new data they are collecting but in order to do so with Hadoop they have had to use servers running the Linux operating system. That left a large number of organizations who standardize on Windows (According to IDC, Windows Server owned 73 percent of the market in 2012 – IDC, Worldwide and Regional Server 2012–2016 Forecast, Doc # 234339, May 2012) without the ability to run Hadoop natively, until today.

windoweleWe are very pleased to announce the availability of Hortonworks Data Platform for Windows providing organizations with an enterprise-grade, production-tested platform for big data deployments on Windows. HDP is the first and only Hadoop-based platform available on both Windows and Linux and provides interoperability across Windows, Linux and Windows Azure. With this release we are enabling a massive expansion of the Hadoop ecosystem. New participants in the community of developers, data scientist, data management professionals and Hadoop fans to build and run applications for Apache Hadoop natively on Windows. This is great news for Windows focused enterprises, service provides, software vendors and developers and in particular they can get going today with Hadoop simply by visiting our download page.

This release would not be possible without a strong partnership and close collaboration with Microsoft. Through the process of creating this release, we have remained true to our approach of community-driven enterprise Apache Hadoop by collecting enterprise requirements, developing them in open source and applying enterprise rigor to produce a 100-precent open source enterprise-grade Hadoop platform.

One of our goals at Hortonworks is to make Hadoop and enterprise viable data platform available on as many platforms as possible. In fact, it is already available today in a range of deployment options including: on-premise, virtual, cloud and an appliance. For organizations looking to leverage Apache Hadoop, they now have even more choices of deployment options between Linux and Windows, giving them more freedom to meet their internal policies and standards. For Microsoft Windows customers, they have complete portability of their Apache Hadoop applications between on premise and cloud deployments, as Hortonworks Data Platform for Windows and HDInsight Service on Windows Azure are built on exactly the same code line.

If you are in the SF Bay Area this week, you can talk to us live about the power of the Hortonworks Data Platform for Windows at booth #316 at the Strata Conference, taking place February 26-28 at the Santa Clara Convention Center in Santa Clara, Calif.

 We will also be conducting the webinar, “Unlocking the Other Half: Introduction to Hortonworks Data Platform for Windows,” on Tuesday, March 12 at 10 a.m. PST / 1 p.m. EST.

To register for the webinar, please visit http://info.hortonworks.com/Hortonworks_HDPonWindows_webcast.html.

 

Buzz Growing for Hadoop Summit Europe

We are now less than a month away from the kickoff of Hadoop Summit Europe, taking place March 20-21 in Amsterdam. The excitement from the community is really starting to grow and pass sales are far ahead of where we expected. Much of the buzz is tied directly to the content that will be presented during the conference.

In all, there were be 42 breakout sessions with presenters from more than 20 companies, including representatives from Adobe, eBay, Facebook, HSBC, LinkedIn, Twitter and Yahoo!. We have started to feature interviews with some of the most compelling speakers on the Hadoop Summit website. Those posted thus far include:

  • Clemens Neudecker of the National Library of the Netherlands and Sven Schlarb of the Austrian National Library (interview)
  • Alasdair Anderson of HSBC (interview)
  • Mikhail Petrenko of Adobe (interview)
  • Jason Dai of Intel (interview)
  • Steve Watt of Red Hat (interview)
  • Joydeep Sen Sarma of Qubole (interview)

The breakout sessions are broken down into four tracks, each aimed at providing valuable and educational content to meet the varied needs of the attendees. We recently featured interviews with each of the track chairs in order to provide some insight into the track sessions and the expected takeaways from each. The interviews are available on the Hadoop Summit website and also linked to below:

  • Evert Lammerts, Track Chair, Operating Hadoop (interview)
  • Isabel Drost, Track Chair, Applied Hadoop (interview)
  • Lars George, Track Chair, Integrating Hadoop (interview)
  • Steve Loughran, Track Chair, Hadoop Futures (interview)

We also recently announced the initial set of speakers for the Lightning Round, which will take place during the first evening of the conference. Speakers will have 5 minutes to cover the topics that the community voted as the ones they wanted to learn about during Hadoop Summit.

The list of the initial 8 Lightning Round sessions is available here.

You definitely don’t want to miss this powerful and exciting lineup of speakers, so REGISTER for Hadoop Summit Europe today!!

Hadoop Summit Europe Call for Papers Ends this Friday, November 30th

The Hadoop Summit Europe official call for papers ends this Friday, November 30th – so be sure to get your session submissions in this week!

Hadoop Summit Europe is March 20, 21 at the Beurs van Berlage in Amsterdam, Netherlands. You still have time to submit an abstract now!

The four content tracks are:

Applied Hadoop

Sessions in this track focus on applications, tools, algorithms and data science as well as areas of advanced research and emerging applications that use and extend the Hadoop platform. Sessions will cover examples of innovative data processing applications and algorithms for performing the most common statistical analysis as well as supporting the latest advances in artificial intelligence and machine learning.

Operating Hadoop

This track focuses on the deployment and operations of Hadoop clusters with an emphasis on tips, tricks, and best practices. Sessions will cover the full deployment lifecycle from installation, configuration, and initial production deployment to large-scale roll out. Reference architectures that maximize performance while minimizing costs will also be covered.

Hadoop Futures

This track takes a technical look at the key open source projects and research efforts driving innovation in and around the Hadoop platform. Attendees will hear from the technical leads, committers, and expert users who are actively driving the roadmaps, key features, and advanced technology research.

Integrating Hadoop

For many, Hadoop success will largely depend on the ability to integrate with existing data-driven and data management technologies. No matter if it is streaming, batch or real time interaction, these integration points are what exposes the value of Hadoop to the rest of the enterprise. This track This track focuses on Hadoop + enterprise (in particular databases, data warehouses, NoSQL, etc.). Sessions will explore these key integration points and will provide deployment and production examples of successful Hadoop integration within the enterprise today.

Announcing Chairs for Hadoop Summit Europe

Track Chairs have been named for Hadoop Summit Europe. Track Chairs will, in turn, select their track committees who, as a team, will decide which sessions are to be presented at Hadoop Summit Europe. They are as follow:

Operating Hadoop – Evert Lammerts, SARA

I joined Sara as a technical consultant in October 2008. In 2009 I started experimenting with non-traditional distributed processing and storage platforms, mainly Hadoop. I’m currently the lead Hadoop and related big data services. I’m also the organizer of the Dutch Hadoop User Group meetup, jury member for the 2012-2013 Norvig Web Data Science Award, and chair of the Operating Hadoop track of the first European Hadoop Summit.

Before joining Sara I lived in Hungary for four years, where I finished my studies Software Engineering at MTA SZTAKI. During those years I also worked as a short-term expert for the Dutch ministry of Agriculture, Nature Management, and Fisheries, in a Twinning project with our counterpart in Serbia. Right now I’m back in The Netherlands.

Applied Hadoop – Isabel Drost-Fromm, Mahout

Isabel Drost is member of the Apache Software Foundation. She is founder of the Berlin Buzzwords Conference, of the Apache Hadoop Get Together in Berlin, and co-organised of the first European NoSQL meetup. Isabel co-founded Apache Mahout and is active Apache Mahout committer. Isabel is actively engaged with communities of various Apache projects, e.g. Lucene and Hadoop. She is regular speaker at renown conferences on topics related to free software development, scalability, big data, Hadoop and Mahout. Currently Isabel Drost works for Nokia Gate 5 GmbH as Software Developer.

Integrating Hadoop – Lars George, Cloudera

Lars George has been involved with HBase since 2007, and became a HBase committer in 2009. He has spoken at many Hadoop User Group meetings, and conferences such as FOSDEM, QCon, and Hadoop World. He also started the Munich OpenHUG meetings. He now works for Cloudera to support Hadoop and HBase in and around Europe through technical support, consulting work, and training. He is also the author of O’Reilly’s “HBase – The Definitive Guide”.

Hadoop Futures – Steve Loughran, Hortonworks, Hadoop

Steve Loughran is a member of technical staff at Hortonworks, where he works on leading-edge developments within the Hadoop ecosystem, including service availability, cloud infrastructure integration, and emerging layers in the Hadoop stack.

Previously, he worked at HP Laboratories on large-scale distributed systems, including cloud computing infrastructures, dynamic Hadoop clusters and configuration management.

He is the author of Ant in Action, a member of the Apache Software Foundation, an active committer on the Hadoop core projects; an inactive committer on Apache Ant and Axis.

He lives and works in Bristol, England.

Call for Abstract is now Open!

The call for abstracts is now open. To submit an abstract, go here: http://hadoopsummit.org/amsterdam/call-for-papers/. The deadline for submission is November 30th, so hurry now!

ApacheCon EU Day One Roundup – Part 1

Hackathon and Aeromuseum Reception

ApacheCon Europe kicked off yesterday with an all-day Hackathon followed by a committer’s reception at the Sinsheim Technik Museum, which has – among other large aircraft, a Concorde in Air France livery. My favorite was the diesel engine from a U-Boat – and its enormous drive-shaft and pistons.

Taking the Guesswork out of Hadoop Infrastructure

Winding a rented Opal through its gears along village roads for half an hour from my hotel-out-of-a-black-forest-fairy-tale, I made it to ApacheCon EU’s first day of sessions mid-way through the first talk by Steve Watt, ‘Taking the Guesswork out of Hadoop Infrastructure.’ Steve talked about the harsh reality of fitting hardware to a given workload using Hadoop with the quote: “We’ve profiled our Hadoop applications so we know what type of infrastructure we need.” — Said No One, Ever. Steve covered ways to instrument your cluster and outlined practical ways to test and tune your Hadoop and HBase clusters.

He also discussed ‘System on a Chip and Hadoop,’ which brings to mind the recent debate about Hadoop-specific hardware solutions.

Discussions in the hallways centered around long-term trends and shifting economics around cluster computing. With the PC rapidly being replaced by mobile devices and tablets, will the economies of scale for large clusters of PCs change? Will the growth of cloud-computing replace the desktop PC and continue to drive economy of scale? Or, will custom solutions start to make headway over commodity hardware over the next five years as the desktop and notebook PC disappear, driving up the cost of PC-based servers and making custom hardware more competitive? Will the economies of scale and power-efficiency of mobile and tablet chips replace the PC processor in Hadoop clusters? Fun stuff to contemplate!



Chart from MobileRodie.

The chart below would indicate that PC nodes will remain competitive, but that mobile-derived hardware may get cheap enough to compete as well! Or perhaps I’m dreaming :)

Enabling Elastic, Multi-tenant, Highly Available Hadoop on Demand

Next up was Richard McDougall with Enabling Elastic, Multi-tenant, Highly Available Hadoop on Demand which covered the ins and outs of Hadoop with virtualization. We’ve talked previously on the Hortonworks blog about virtualization as a part of Hadoop NameNode HA on Hadoop 1.

Virtualizing Hadoop data nodes on Amazon EC2 or VMWare has posed a major tradeoff in performance in the past, and VMWare is hard at work getting that penalty down to 10% for VMWare virtualized Hadoop clusters. Project Serengeti was founded with this goal in mind.

Extending lifespan with Hadoop and R

Radek Maciaszek presented Extending lifespan with Hadoop and R, which covered his project to identify aging related genes using R and Hadoop at the UCL Institute of Healthy Aging.

Inside Hadoop Development

Hortonworks’ own Steve Loughran presented Inside Hadoop Development.

Thats it for now, I’ll summarize the rest of the day, up next!

DINOSAURS ARE REAL: Microsoft WOWs audience with HDInsight at Strata NYC (Hortonworks Inside)

You don’t see many demos like the one given by Shawn Bice (Microsoft) today in the Regent Parlor of the New York Hilton, at Strata NYC. “Drive Smarter Decisions with Microsoft Big Data,” was different.

For starters – everything worked like clockwork. Live demos of new products are notorious for failing on-stage, even if they work in production. And although Microsoft was presenting about a Java-based platform at a largely open-source event… it was standing room only, with the crowd overflowing out the doors.

Shawn demonstrated working with Apache Hadoop from Excel, through Power Pivot, to Hive (with sampling-driven early results!?) and out to import third party data-sets. To get the full effect of what he did, you’re going to have to view a screencast or try it out but to give you the idea of what the first proper interface on Hadoop feels like…

There was a comedian who had a bit about… remember when you first saw Jurassic Park for the first time? No matter how old you were, your child-like response was, “DINOSAURS ARE REAL!!!!!!$!!$##!” Our reaction to Jurassic Park was CGI technology disrupting cinema, provoking the same kind of reaction early cinema had on viewers who felt real concern that the horse or train approaching would run them over. At least thats what I learned wasting a lottery-funded academic scholarship on film classes at a state university before having the good sense to fail out and use my time productively.

That feeling you got when you saw your first CGI raptor is what Microsoft’s demo was like, except it went… “HADOOP IS IN EXCEL!!$%!%!%!$????!!!”

This is a serious thing for me, because I hooked up Pig and Excel years ago:

Which is a crappy demo of Hadoop connecting to Excel, but which gives me mucho moral authority to state that Microsoft’s demo was the right way to hook data to Excel. Take it from someone that spent half of his twenties trying to build web applications that could compete against Excel: until data is in Excel… it ain’t real. With Microsoft’s new offering… big data just got real.

To put this into perspective:

And just so you know I’m not bullshitting you about Hadoop and Big Data and Raptors and next thing you know you’re checking for your wallet and nodding awkwardly and trying to find a pause in this lunatic rant to get the hell out of here, I’ll just come out and tell you:

I have a raptor named lame-o-saurus in a Cowboy Curtis hat permanently tattood on my body. Again, we resort to visualization (mind the hair):

To summarize:

  1. I am the world’s primary authority on the wrong way to hook Hadoop to Excel.
  2. I have strange tattoos which affirm the validity of my metaphors.
  3. Microsoft has fundamentally altered Big Data with their HDInsights offering.
  4. Yesterday, a breakthrough happened in the Regent Parlor of the Hilton, NYC.

Visicalc… we’ve come such a long way.

Strata NYC Reporting: Monday @ Big Data Camp, Tuesday @ Strata Retrospective

This is Russell Jurney, your Big Data reporter on the ground here at Strata NYC/Hadoop World at the New York Hilton. Monday night’s main event was Big Data Camp. As in any unconference, the best action was in the hallway, meeting people you only know by reputation or from twitter. Highlights were:

  • Microsoft’s demonstration of Excel -> Power Pivot -> Hortonworks Data Platform
  • In light of today’s announcement – the Hadoop market just got MUCH bigger :)

  • Druid: Real-Time Analytics at a Billion Rows Per Second by Eric Tschetter, Co-founder of Metamarkets
  • In-RAM stores are an interesting new development as RAM becomes cheaper and cheaper, and can augment a Hadoop-centric workload.

  • The Horrors Hidden in Your Models by Steven Hillion
  • This talk stressed the importance of unit testing your statistical models and finding areas where they fall-over, then working with customers to understand the problem. A humorous use-case involving a hoax ‘finger-in-chili’ incident was examined.

Tuesday’s tutorial sessions were great. My favorites were:

Check back tomorrow for coverage of Wednesday’s technical sessions!

Enabling Big Data Insight for Millions of Windows Developers

At Hortonworks, we fundamentally believe that, in the not-so-distant future, Apache Hadoop will process over half the world’s data flowing through businesses. We realize this is a BOLD vision that will take a lot of hard work by not only Hortonworks and the open source community, but also software, hardware, and solution vendors focused on the Hadoop ecosystem, as well as end users deploying platforms powered by Hadoop.

If the vision is to be achieved, we need to accelerate the process of enabling the masses to benefit from the power and value of Apache Hadoop in ways where they are virtually oblivious to the fact that Hadoop is under the hood. Doing so will help ensure time and energy is spent on enabling insights to be derived from big data, rather than on the IT infrastructure details required to capture, process, exchange, and manage this multi-structured data.

So how can we accelerate the path to this vision? Simply put, we focus on enabling the largest communities of users interested in deriving value from big data.

Since one of the world’s most widely used business intelligence tools is Microsoft Excel, and since Microsoft is arguably one of the best companies at enabling and mobilizing large and vibrant developer communities, needless to say we at Hortonworks are excited and bullish on the expansion of our partnership with Microsoft.

Today Microsoft unveiled previews of Microsoft HDInsight Server and Windows Azure HDInsight Service, big data solutions that are built on Hortonworks Data Platform (HDP) for Windows Server and Windows Azure respectively. These new offerings aim to provide a simplified and consistent experience across on-premise and cloud deployment that is fully compatible with Apache Hadoop.

This news represents a significant inflection point for the big data market in general and for the importance of open source Apache Hadoop in particular. Unlocking the Windows Server and Windows Azure markets for Hadoop means more businesses will be able to tap into its benefits.

Moreover, these new offerings represent months of joint engineering work across both the Microsoft and Hortonworks engineering and product teams. Microsoft’s commitment to doing this work in a way that improves open source Apache Hadoop and related Apache projects has been unwavering; which translates into goodness for the open source community.

I encourage you to try out the fruits of our labors in one of two ways:

• Download Microsoft HDInsight Server and play with Hadoop on your own Windows machine.
• Access Windows Azure HDInsight Service and play with Hadoop in the cloud.

I encourage you to go to http://hortonworks.com/partners/microsoft/ in order to learn more and get started!

Finally, check out Microsoft’s announcement for more information! http://blogs.technet.com/b/dataplatforminsider/archive/2012/10/22/simplifying-big-data-for-the-enterprise.aspx

Hadoop Summit Expands to Europe in 2013!

This will be the first and the largest European conference focused exclusively on accelerating the enterprise adoption of Apache Hadoop. The event will be a gathering for the vibrant Apache Hadoop community of developers, data scientists, data professionals and solution providers and will be held at the historic Beurs van Berlage in Amsterdam on March 20-21, 2013.

Call for papers now open!

Apache Hadoop practitioners, enthusiasts and solution providers with an idea for a talk at the event, can submit your ideas now on the call for papers page. All accepted speakers will receive complimentary admission to the event.

More information on Hadoop Summit Europe, go to: http://hadoopsummit.org/amsterdam.

Remember to follow us on Twitter and Facebook for future updates!

We hope to see you there!

Go to page:123