Category Archives: Hortonworks Topics


Week in Review: OpenStack, Data Science and Ambari

Almost time to spend a relaxing weekend in the garden, or crushing some data in your garage-based homebrew Hadoop cluster – whichever you prefer. But before we choose our path, let’s take a look at the last two weeks of happenings (I was lost in Oregon last week).

Hadoop is the perfect app for OpenStack. While I was struggling with driving directions, Red Hat, Marantis and Hortonworks were announcing plans for Project Savanna which aims to automate the deployment of Hadoop on enterprise-class OpenStack-powered clouds. Jim also wrote up some comprehensive notes from the awesome OpenStack Summit event.

Need Data Science? Here’s how to build a team. Ofer followed up his post on 4 Reasons to use Hadoop for Data Science post with some thinking on the continuum of skills and roles that represent a data science team. This proved to be something of a hot topic, and was referenced amongst some collective thinking on GigaOM. In a subsequent post, he also dived a little deeper into Data Agility.

 

Managing Hadoop? Some field notes from the first Apache Ambari Meetup. This inaugural meetup at our office was well attended with some great discussion, and we published the presentations and recordings over here.

 

Data Warehouse? Hadoop? When to use Which. In an interview as a backdrop to a Teradata-hosted webinar: Hadoop & the Enterprise Data Warehouse: When to Use Which, Chad Meley, Eric Baldeschwieler and Stephen Brobst talk about their experiences with both as the  It’s on April 30th, so still time to register.

Considering deploying a Hadoop cluster? OK, so a Hadoop cluster sounds like an awesome idea – but what are the things you should consider in building that infrastructure. This checklist from HP maybe useful for your planning.

And finally some stuff to do:

Have a great weekend!

How to Build a Hadoop Data Science Team

Data scientists are in high demand these days. Everyone seems to be hiring a team of data scientists, yet many are still not quite sure what data science is all about, and what skill set they need to look for in a data scientist to build a stellar Hadoop data science team. We at Hortonworks believe data science is an evolving discipline that will continue to grow in demand in the coming years, especially with the growth of Hadoop adoption. This role requires experience and knowledge in math, statistics and machine learning, programming and scripting, as well as visualization techniques.

Hadoop data scientists

We tend to think of the data scientist role as a continuum of skills:

Software engineers really enjoy crafting new production-grade software systems, that are testable and maintainable, secure and scale well. Some of those software engineers specialize in working with data. They tend to be highly skilled in technologies like SQL, Hadoop, HIVE/PIG and Map-reduce, and excel at building production quality data pipelines. We call those “data engineers”.

Research scientists focus on academic research in machine learning and statistical techniques, creating brand new algorithms like support vector machines and deep learning, and prove theoretical properties of such algorithms. Applied scientists are those research scientists who thrive on solving real world problems with real data. They are very good at applying state-of-the-art algorithms and techniques to real world data.

The data scientist role combines the skill set and experience of a data engineer with that of the applied scientist. It is quite difficult to find good data scientists, because the combination of all these skills and interests are rarely found in a single person.“Okay, okay, I understand it’s hard to find good data scientists”, you may say, “but I still need to complete my data projects, what should I do?” One option might be to train data engineers to be experts in math, statistics and applied science. Or maybe hire applied scientists and train them to be good software engineers. In my experience that approach has limited success, because good software engineers may not be as good in applied science, or may not be interested to shift their career in that direction. And vice versa.

Instead, simply build a Hadoop data science team that combines data engineers and applied scientists, working in tandem to build your data products. Back when I was at Yahoo!, that’s exactly the structure we had:  applied scientists working together with data engineers to build large-scale computational advertising systems.

 

 

Week in Review: Patterns, Glue and Moonshots

The end of another action-packed week and just before we all head off for the weekend then let’s have a recap on the conversations from this week – we hope you’re enjoying them.

We’re delighted by the response to our Hadoop Patterns of Use whitepaper and presentation - that really seems to have struck a chord with everyone thinking about what Hadoop can really do for their business. You can see that content just below here – an excellent read for the journey home.

Thumbnail

Also popular was the slides from one of our resident data scientists, Ofer Mendelevitch, who had 4 great reasons to use Hadoop for data science. He’ll be mining for more right now. Another article we liked from Stratconf explained the importance of imagination in data science.

 

Mid-week, we turned our attention to the awesomeness of HCatalog and spent a little time geeking out on the capabilities it provides as the glue for all your data. We also got a little bit excited about the HP Moonshot announcement - we love the idea of an appliance that can enable 1800 nodes in a single chassis. Wow.

But wait there’s more… Justin published the 2nd in a series of guest posts from Charles Boicey on a real-life implementation of Hadoop to improve patient monitoring in healthcare. And sneaking in at the end of last week we looked at the reality of integrating SAP and Hortonworks Data Platform.

And technically, we saw some interesting articles:

Enough to keep you going until next week? OK, one more then… Cheryle offered some great advice on things you can do in the Sandbox to boost your skills. Go on, get stuck in.

UC Irvine Health: Improving Quality of Care with Apache Hadoop (Part 2)

This is the second part of a series written by Charles Boicey from UC Irvine Health (part 1 is here). The series will demonstrate a real case study for Apache Hadoop in healthcare and also journal the architecture and technical considerations presented during implementation.

UC Irvine Health new logo

It has been 232 days since the last post. Much has transpired including a rebranding of the organization from UCI Medical Center to UC Irvine Health. I am happy to report we have a production Saritor environment up and running on the Hortonworks Data Platform.

Here are some highlights from the past 232 days:

Home Monitoring

In collaboration with our medical device integration partner, iSirona, we are developing a system to acquire home monitoring data and transmit it to Saritor. Our first deployed device will be a scale. This may sound simple, but in-home monitoring of the daily weights of Congestive Heart Failure patients is essential for the prevention of those patients readmitting to the hospital.

Home monitoring data will not be transmitted directly to the Electronic Medical Record (EMR), for a very specific reason. Home device data from thousands of patients transmitted directly to the EMR would be a nightmare for clinicians to manage. It would be too much data. By sending the data to Saritor first, an algorithm can determine which changes in weight indicate risk of re-admittance and then notify clinicians about those cases. All home monitoring data will be viewable in the EMR via an API to Saritor.

In-Hospital Monitoring

We are working on a pilot to enhance patient monitoring in the hospital. In California, nurses typically have up to five patients to care for, and it can be challenging to be with a patient at the bedside and also keep a close eye on all the small changes in vitals across all patients.

Soon, hospitals will be able to provide each new inpatient with a wearable disposable patch that monitors vital signs such as heart rate, temperature, pulse oximetry and wirelessly transmit that data every minute to Saritor. An algorithm can “watch” that data for patterns that the nursing team might not be able to catch. Because nurses cannot watch a monitor for every minute of their shift, Saritor has “got their back”. Nurses can go about the business of caring for patients and Saritor will notify them when there is a disturbing pattern in a patient’s vitals. A data warehouse might be able to run a similar algorithm, but with 24-hour latency. That’s too much latency for a nurse to respond quickly to an emergent situation.

Patient Self-Monitoring

With the increasing numbers of patients joining the “Quantified Self” movement we see Saritor as the ideal environment to help receive more health data generated by the patients themselves. We want to store and make use of patient-generated data from personal health records and home monitoring. Sites such as Fitbit, 23 and Me and others could also feed in data. With open APIs to a patient’s personal health record this data can be ingested into Saritor and then be made available to clinicians via the EMR. Score cards from the EMR data in Saritor can also be pushed back out to the patients.

Other Lessons We’ve Learned

Hadoop Plays Well with Others

One awesome discovery we made was that the Hadoop Ecosystem plays well with other systems. We were able to start ingesting data into Hadoop, without having to change anything within the current IT environment. For example, all of the healthcare data ingested into Saritor goes into HDFS. For the monitoring of inpatients, Map Reduce jobs run against HDFS and then push that data into MongoDB. Algorithms in Mahout run against the data in MongoDB and can push notifications to the EMR via an event engine.

For graph analysis of healthcare data MapReduce jobs run against HDFS and then output in graph form for input into Neo4j.

Legacy Healthcare Data Is Valuable

We ended up with 9 million patient records spanning 22 years and 1.2 million patients. Our original estimate was 3 million records. We are using this data to build our surveillance algorithms.

Social Media Is an Important New Source of Information

Saritor is capable of storing social media data related to UC Irvine Health and a UCI student project is underway to develop a sentiment analysis dashboard to better understand the social media environment external to UC Irvine Health. As part of the patient experience feedback loop we will be able to reach out and connect with patients to better understand their concerns so that we can enhance the patient experience.

Others in the Healthcare Community Are Interested in Adopting Hadoop

I’ve spoken with many other healthcare providers that are trying to solve the same type of problems, all are eager to exchange Hadoop best practices.

In the next installment, I’ll give an update on the results of our monitoring pilots, describe our progress on surveillance algorithms, and tell you more about our collaboration with other hospitals and clinics.

If you’re considering your own Hadoop implementation, then click here to learn more about Hortonworks Data Platform, and here to understand how it might work for your business with our whitepaper, Hadoop Patterns of Use.

 

Hortonworks Sandbox: Dreaming Up New Tutorials For You

We’re cooking up some new tutorials for you to play with in your Hortonworks Sandbox to help you learn more about the Hortonworks Data Platform, Apache Hadoop, Hive, Pig and HCatalog, with maybe a smattering of Mahout in there as well.

More about Sandbox »

While you’re anxiously awaiting, we thought we’d give you some pointers to some resources so that you can experiment and play. After all, that’s what a Sandbox is all about, right?

Language Manuals

First, if you’re looking to expand your skills, take a look at Hive Language Manual, the Pig Tutorial on the Apache Foundation website, and Command Line Interface information on HCatalog project incubator site.

Use Hive to SQLize

Feeling a bit more advanced? Take a look at Russell Jurney’s blog posts, HOWTO use Hive to SQLize your own Tweets Part 1, and HOWTO use Hive to SQLize your own Tweets Part 1.

Pull In Your Own Data

You have datasets. We know you want to put them in Hadoop. It’s easy. You can import your own data into the Sandbox the same way you imported data sets in the tutorials.

Looking for other interesting data sets? There are many interesting sets for you:

In the meantime, we’re working hard to bring you new and interesting tutorials. We’d love to see what you’ve done. Show us your demos and tutorials — who knows, there might be one of the coveted stuff elephants in your future!

Where are Hortonworkers? Events and Meetups 8th April to 22nd April

Hortonworkers are out there – here is a rundown of events and meet ups we’ll be at in the next couple of weeks and we hope we’ll see you there. Did we miss any? Want us to attend your event? Let us know!

Big Data Innovation Summit

April 10-11, 2013, San Francisco, CA

http://theinnovationenterprise.com/summits/big-data-innovation-summit-april-2013-san-francisco

Spring into April and jump into Big Data! Be sure to meet us at Big Data Innovation Summit by the bay. We’re excited to have Alan Gates, co-founder of Hortonworks, presents on a couple of really exciting talks and we hope you can join us.

  •  April 11 @9:30am: Coordinating the Many Tools of Big Data in Hadoop
  •  April 11 @ 12:30pm: Hadoop Now, Next and Beyond
  •  April 11 @ 2:00pm: Roundtable Session: Use Case Patterns: Horizontal or Vertical

As a global sponsor, we’ll also be exhibiting. Look for us in the exhibit area and meet members of the Hortonworks team, who will be happy to discuss any questions you have on Hadoop and Hortonworks.

PASS Business Analytics Conference

April 10-12, 2013, Chicago, IL

http://www.passbaconference.com – booth S5

We’re excited to participate in the first PASS BA or Business Analytics community driven event. We will be speaking at three sessions: “Why Apache Hadoop for Data Science”, “The Future of Apache Hive and Hadoop 2.0”, and “Big Data: Threat or Opportunity?”

Teradata Universe Copenhagen 2013

April 14-17, 2013, Copenhagen, Denmark

http://www.teradataemea.com/

We’re delighted to be a Platinum sponsor at Teradata Universe. The conference gathers experts from internationally recognized companies and presenters from Teradata’s customer community to deliver insights on new trends driving the industry on how Big Data Analytics are used to drive business value.

Chris Harris, Solutions Engineer at Hortonworks, will be speaking at the Solution Showcase on “Big Data: Making Sense of it all!” on Monday April 15 at 12:40 and Tuesday April 16 at 11:20.

More on the Hortonworks / Teradata partnership can be found at www.hortonworks.com/teradata

eMetrics Summit

April 14-18 2013, San Francisco

http://www.emetrics.org/sanfrancisco/2013/

Hortonworks VP Products, Bob Page, will be speaking at two sessions at this analytics event.

OpenStack

April 15-18, 2013, Portland, Oregon

http://openstacksummitapril2013.sched.org/

We’re heading to our very first OpenStack Summit to talk about all things Apache Hadoop on OpenStack and we would love to meet you! A cloud deployment model makes perfect sense for Hadoop, which (a) allows for efficient infrastructure usage and (b) is a net new workload for most organizations (awesome…far fewer legacy considerations).  So Hadoop + OpenStack seems like a logical fit.  If your organization is interested in combining these two mega technology trends, it would be great to connect with our team who can share what others are doing!

There are many ways to meet the Hortonworks team!
We’ll be speaking:

And we’re exhibiting! Come by our Hortonworks booth, say hello, geek out to Hadoop and Big Data and pick up an awesome swag while you’re at it!

Charlotte Hadoop Users Group, 11th April 2013

http://www.meetup.com/CharlotteHUG/

Terry Padgett will present on the Stinger Initiative, Tez and Knox

Bay Area HUG, 17th April 2013

http://www.meetup.com/hadoop/events/63737062/

Owen O’Malley will present on the Stinger Initiative

Chicago HUG, 22nd April 2013

http://www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG/events/106391622/

George Vetticaden will present on the Stinger Initiative, Tez and Knox.

Week in Review: Falcon, Hadoop Momentum and BFFs Forever!

More of a 2 weeks in review this time around owing to the Easter break. So what’s been happening?

Falcon bringing Data Lifecycle Management for Hadoop. The big news this week was the newly approved Apache Software Foundation incubator project – Falcon. The project was initiated by the team at InMobi and engineers from Hortonworks towers with the intent of simplifying data management through a data lifecycle management framework. Something for everyone then. More on Falcon here. Once again, it’s a great example of community driven open source driving the innovation that matters, or as Mohit Saxena of InMobi said:

fal1

Want to be BFFs with Hortonworks? According to this article on TechWorld, everyone does, and Neustar details why. We’re flattered by the sentiment and we’d love to be your friend. You can ‘Like’ us over here.

Market Momentum. So, with all of the innovation and buzz around Hadoop and Hortonworks, what does that mean for you, me, or anyone looking to dip a toe in the water? This post highlighted the market momentum and the surrounding skills and jobs and how you can get involved. I recommend you start by grabbing a copy of the Sandbox and take advantage of this graph…

Hadoop Summit Keynotes and Sessions. As memories of Amsterdam glow in the mind, the content from the event began to flow, and you can now view the videos and slides of keynotes and sessions on the summit site. We also announced the selectees of the community choice section of Hadoop Summit North America in San Jose, and the panels are now hard at work selecting the remaining sessions. You have registered haven’t you?

 

And finally, can you define Big Data? I guess that depends on your individual perspective. In this short piece, Russell describes Big Data it in terms of transformative economics. Something to chew on until next week.

Have a great weekend!

Hadoop Market Momentum and You

On 27th March, the Wall Street Journal published an article ‘VCs Bet Big Bucks on Hadoop’ and it seems clear that the market is going to be huge. But what does that mean to you and your personal skills investment? Here’s our view:

Hadoop is HOT

Hadoop is incredibly hot right now as the number of available jobs continues to grow enormously (hey – we even have a bunch of our own right here).

Indeed’s Job Trends shows Hadoop as 7th hottest skill and it’s in great company alongside those app development skills such as iOS, Android and jQuery. I guess that’s to be expected of course: insights from big data is the fuel to smartest apps of the future.

The Hadoop trend itself is fairly clear. In growth terms, that is pretty explosive!

Indeed Job Trends

 

A quick search on LinkedIn will pull back around 1200 Hadoop jobs right now (it was 1281 when I checked). And you can also look at the Skills page to see the associated set of component technologies and their relative growth.

Hortonworks is HOT

Apart from the WSJ, just last week, MomentumIndex called out Hortonworks as the 2011 Startup with the most Momentum from a pool of 900 startups being tracked from that year.

We also know when we talk to customers that they’re excited about our approach to pure, community-driven, open source Hadoop. We know developers are excited to get hands on with Hadoop via the Sandbox. And we say great public responses like those we saw at Hadoop Summit Amsterdam, that our approach is the right one.

Hadoop, Hortonworks and YOU are HOT

Hortonworks believes in Hadoop and we believe in the power of community-driven open source. We know that this is just the beginning for Hadoop and we back everyone investing their skills in Hadoop, and taking this journey with us. All the way.

Get Started: You can get started by downloading our Sandbox - it’s a VM package containing everything you need to run a single node cluster (I love that expression!) and is packed with tutorials and demos.

Get Connected: Stay in touch. When we say community we mean it – come follow us on TwitterFacebookLinkedIn- we want to hear from you as to how we’re doing to provide you with the tools and capabilities to do what your business is demanding. Find a Hadoop User Group (HUG), and come along to the Hadoop Summit.

Get Certified: If you want to differentiate yourself and grab one of those jobs, then you can train and certify with us too. All of the details on that are here.

Dive in and enjoy.

Week in Review: Sandboxes, HDP 2.0 Alpha 2, Hive Performance and Summits

Hadoop Summit It’s almost time for that final drive home of the week, and what a week it has been with a few new releases, a summit, and a little bit of technical fun. Here’s what happened:

New Sandbox Release. Yes, your favorite Hadoop VM image just got even better. Cheryle took us through the new features which included Ambari integration and Russell followed up with a quick tour of Ambari. There’s still plenty of time to download Sandbox for a weekend of data crunching fun.

HDP 2.0 Alpha 2 was released. This preview release demonstrates some of the performance improvements in store for the final HDP 2.0 release via YARN, enhancements to Hive per the Stinger Initiative, and Apache Tez. Just before the release, we posted some early test results which showed a 45X (yes, that’s forty five) performance improvement for Hive interactive queries. But that’s just the beginning as we push to 100X, and Microsoft also talked about their contributions to the Stinger Initiative with the same aim in mind.

If you’ve downloaded Sandbox and are looking for some inspiration for a little fun, then Russell also posted a two part series on extracting, loading, querying and analyzing your own Twitter archive with Hive. Part 1 is here, and Part 2 is here.

And finally, there was just the small matter of the Hadoop Summit in AmsterdamWe had a great time and hope you did too. Thank you for attending, contributing to the conversation and supporting Hadoop. If you’re now really excited to learn Hadoop, we posted about available training we have in Europe and Palo Alto.

And that was the week that was. Has your Sandbox downloaded yet?

Hortonworks Data Platform 2.0 Alpha 2 now available: focus on performance

We are very pleased to announce the Alpha 2 release of the Hortonworks Data Platform 2.0 (HDP 2.0 Alpha2) is now available for download!

A key focus in HDP 2.0 Alpha 2 is on performance as announced in the Stinger initiative, and includes a series of enhancements to the performance of Apache Hive for interactive SQL queries.  In fact HDP 2.0 Alpha 2 was used to perform the tests announced yesterday, showing a 45X performance increase using Hive.  There is much more to come but we are pleased with the early results, and encourage Hive users to take a look and continue to give us feedback.

Consistent with HDP 2.0 Alpha 1, this version is built from the developmental Apache Hadoop 2.0 line and includes Apache YARN, a next-generation resource-management and application framework that enables Hadoop to support an ever-expanding range of use cases.  We are extremely excited about the opportunities that YARN enables – for background, check out Arun Murthy’s blog post series where he provides a YARN overview.

Notable new components over Alpha 1 include:

  • Apache Tez: A new Apache project that provides an optimized data processing framework on top of YARN. Tez is a general-purpose, highly customizable framework that simplifies data processing tasks across both small-scale, low-latency and large-scale, high-throughput workloads in Hadoop. Tez can provide an order of magnitude performance boost for the broader ecosystem of data processing tools such as Apache Hive and Apache Pig.
  • Apache Hive Interactive Query: Beyond the speedups made possible by Apache Tez, several new features were added to speed Hive queries. A new file format called the ORCFile (optimized RC file) optimizes how data is stored and accessed in Hive, and significant query optimizations reduce latency and improve performance.

Note that Tez is not enabled by default.  Instructions for doing so, and allowing Hive to use Tez, are in the installation guide.

Learn More
Please take a look at the Hortonworks Documentation to learn more about installing and using HDP 2.0 Alpha 2.

Download It
You can download HDP 2.0 Alpha 2 from the Hortonworks Download site.

Tell Us About It
Please visit the HDP 2.0 Alpha Forum to ask questions, get help, provide feedback and hear what others are doing with HDP. 

We are excited about the opportunities that Hadoop 2 provides for the future of Hadoop and large-scale data processing. HDP 2.0 Alpha 2 is a key milestone that provides organizations with a packaged release to evaluate and gain experience with the upcoming Apache Hadoop 2 technology stack. We look forward to your feedback on HDP 2.0 Alpha 2 while we work with the community to make Hadoop 2 a stable reality. Enjoy!

Note: This Alpha release is a technology preview to gather feedback from outside of Hortonworks. Some features are missing or incomplete. Some APIs may change. Do not use Alpha 2 for production use. Keep away from open flame. Support is only available via Forums.

Week in Review: From Plastics to Windows

We’re wrapping up another busy week at Hortonworks towers. I say another, but actually this is my first week. So… it’s a hello from me, I’m Marc Holmes, Community Director. What have we been talking about this week?

Plastics and Hadoop: discuss! We started the week with a post from our VP of Products, Bob Page drawing an analogy to the growth of the plastics industry with the disruption to the database market driven by Hadoop, looking at the connections and differences to SQL and pointing out ‘what we don’t know yet’ on the evolution of use cases for Hadoop.

Hadoop and Windows sitting in a tree… Arun and Suresh highlighted the joint effort between Hortonworks and Microsoft to make Apache Hadoop run natively on Windows, and celebrated the community vote to move this work into the mainline trunk. We’re community-driven open source folk and we’re delighted not only by the code, but the spirit of community contribution throughout. Microsoft talked about this work over on their Port 25 blog.

Out there. Meantime, there was a LOT of discussion on a couple of articles including this one - Proprietary Hadoop is a Losing Strategy - and this one - One Hadoop Distribution To Rule Them All as a follow up. We believe, and Arun points out, that ‘ultimately the winners in Hadoop will be those investing most heavily in its success’.

But what do you think at a personal level? Do you want Hadoop skills, or Hadoop-a-like skills? Let us know.

And finally, talking of skills, Russell Jurney explained how to Install Hadoop on Windows. So now you know.

Next week… should be quiet. Only the Hadoop Summit in Amsterdam, and a bunch of exciting stuff we’ll tell you more about then. Stay out of trouble and enjoy the show!

Getting Ready for The Elephant Party in Europe

We are just under two weeks away from start of the first ever Hadoop Summit Europe and with all of the final preparations being made we thought we would highlight some of the not to be missed activities in and around the event. The event is filling fast but you can still register here.

Here are 10 great reasons to attend!

1)   Great track content – there are 35 informative sessions on Apache Hadoop and related technologies for you to choose from selected by the community and delivered by the experts themselves.

2)   Great keynotes – leading industry analyst Matt Aslett will present the opening keynote and we will also hear from open source veteran Shaun Connolly as well as Hortonworks CTO Eric Baldeschwieler

3)   Hadoop in the Enterprise expert panel – We will have a live panel discussion from industry leaders incuding eBay, HSBC and Neustar discussing how and why they use Apache Hadoop.

4)   Meetups – the NLHUG and other communities will be meeting around the event.

5)   Lightening talks – we’ve got rapid fire content coming to you in the form of community selected lightening talks. These 5 minute sessions will give you a taste of a wide range of technologies and initiatives

6)   It’s Amsterdam – historic, edgy and fun!

7)   Ecosystem – The conference has the support of the broader Hadoop ecosystem so you can come and discuss Hadoop and big data in the solutions showcase.

8)   Community – The Apache Hadoop community is big and getting bigger. Come meet and mingle with other community members to learn about the latest goings on and make new connections.

9)   Get Hadoop certified – Calling all Hadoop Experts! We’re bringing certification to you! If you are ready to take the exam to become a Hortonworks Certified Apache Hadoop Developer (HCAHD) or a Hortonworks Certified Apache Hadoop Administrator (HCAHA).

10)   Get trained on Hadoop – we’ve got a host of classes available during the event to help you learn or sharpen your Hadoop skills. This includes a newly added Applying Data Science class. Check out the classes.

11)  BONUS reason – have a beer on us at the Hadoop Summit Party at the Heineken Experience a cool venue at a historic location.

Register now, don’t miss the party hope to see you there!

Doing More with the Hortonworks Sandbox

The Hortonworks Sandbox was recently introduced garnering incredibly positive response and feedback. We are as excited as you, and gratified that our goal providing the fastest onramp to Apache Hadoop has come to fruition. By providing a free, integrated learning environment along with a personal Hadoop environment, we are helping you gain those big data skills faster. Because of your feedback and demand for new tutorials, we are accelerating the release schedule for upcoming tutorials. We will continue to announce new tutorials via the Hortonworks blog, opt-in email and Twitter (@hortonworks).

When the new tutorials are ready, the update process is a simple with one click of a button. Simply go to the “About Hortonworks Sandbox” icon, and press the Update button. Your initial Sandbox virtual machine installation will remain and only the tutorials will be updated.

sandbox_screenshot2

 

One of the other requests you had is to have access to more interesting datasets, for you to experiment more with the Sandbox. First, we designed the Sandbox so that you can add your own data into the Sandbox. Since the Sandbox runs on your own system, you control who has access to that data. Second, if you want to play with external data sets, here are a few resources where you can find publicly available data:

In the meantime, if you haven’t yet downloaded or installed the Sandbox, we encourage you to take part in the excitement. Should you need assistance, please go to the Hortonworks Sandbox Forums. Please join us for the Sandbox Webinar on Tuesday, February 5 at 10 am PST. And finally, check back to learn more about the release of new tutorials.

Hortonworks Achieves Quality Assurance and Certification for Rackspace Private Cloud

Today we announced Hortonworks Data Platform certification for Rackspace Private Cloud. In fact, we are the only Apache Hadoop distribution certified with Rackspace Private Cloud. The result of combining the power of enterprise-class Apache Hadoop in Hortonworks Data Platform (HDP) with Rackspace Private Cloud, is that organizations now have a secure, scalable environment to refine, explore and enrich their data using Hadoop in the cloud. With HDP, data can be processed from applications that are hosted on Rackspace Private Cloud environments, allowing you to quickly and easily obtain additional business insights from this information. The provisioning, monitoring and management components of HDP are important enablers for the integration with the Rackspace Private Cloud, providing an easy path for getting data into and out of the cloud.

Hadoop Summit Session for Your Consideration: Taking Hadoop to the Clouds

If you been following #hadoopsummit on twitter you might have noticed some excitement around the community choice, a public voting system that enables the entire Apache Hadoop community to have a say in the sessions chosen for #hadoopsummit EU. Anyone can vote and the top vote getters in each track will automatically be included in the #hadoopsummit EU agenda, March 20-21, 2013.

If you’re still deciding which sessions, in which tracks, should be so lucky to get your vote, I have one for your consideration. Our very own Steve Loughran went beyond the twitter-sphere and created a blog to promote why you should vote for his session: Taking Hadoop to the Clouds.

Before we proceed to Steve’s case, remember to vote in the Community Choice process. Help us shape the conference agenda by getting in your vote! Deadline is December 14, so vote today!

This is a guest blog post from Steve; making a strong case to why you should pick his session. 

The Hadoop summit vote list is up, and I have two proposals -currently undervoted. Even though I’m on the review committee for the futures strand, not even I could push through a talk, which had zero votes on it, -ideally I’d like my talks to get in through popular acclaim. I could just create 400 fake email addresses and vote-stuff that way, but I’m lazy.

For that reason, I’m going to talk in detail about why my talks will be so excellent that to even think about having them left out could be detrimental to the entire conference.

 

 

 

 

 

 

 

 

 

 

One of my talks is “Taking Hadoop to the Clouds”.

There are two competitors

  1. Deploying Hadoop in the Cloud, which looks at options, details and best practices. I don’t see anything particularly compelling in the abstract -I assume it’s got more votes as it’s the one that comes up first. Or they are trying the many-email-address-vote-stuffing technique(*).
  2. How to Deploy Hadoop Applications on Any Cloud & Optimize Price Performance.  This could be interesting, as it covers how CliQr deploys Hadoop on different infrastructures. It sounds like a rackable-style orchestration layer above infrastructures, for Hadoop it may have similarities with MastodonC’s Kixi work,

Why then, should people vote for mine?

I’m giving the talk.

This is not me being egocentrically smug about the quality of my presentations, but because I’m reasonably confident I know a lot about the area.

  1. My last time at HP Labs was spent on the implementation of the “Cells” virtual infrastructure: declarative configuration of the entire cluster design. The details were presented at the 5th IEEE/ACM conference on Utility and Cloud Computing, and will no doubt be in the ACM library. This means I know about IaaS implementation details; the problems of placement, why networking behaves the way it does, image management, what UIs could look like, what the APIs could be, etc.
  2. I’ve spent a lot of time publicly making Hadoop cloud-friendly. I presume that MS Azure and AWS ElasticMR have put in more hours, but unless they’re going to talk about their work, Tom White and myself are the next choices. Jun Ping and VMWare colleagues have done a lot too -and big patches into the codebase, but I don’t see any submissions from them.
  3. I have opinions on the matter. They aren’t clear cut “cloud good/physical bad” or “physical bad/cloud good”. There are arguments either way; it depends on what you want to do, what your data volume is, and where it lives.
  4. I’m still working in the area, in Hadoop itself and the code nearby.

Recent cloud-related activities include

  • HADOOP-8545: a  Swift Filesystem driver for OpenStack. This is something everyone running Hadoop on Rackspace or other OpenStack clusters will want. This week two different implementations have surfaced, getting them merged together is going to be the next activity,
  • WHIRR-667: Add whirr support for HDP-1 installation
  • Ambari with Whirr. Proof of concept more than anything else.
  • Jclouds and Rackspace UK throttling. Adrian Cole managed to reduce the impact of issue-549, which is good as I don’t really want to get sucked into a different OSS codebase,
  • Other things that I’m not going to talk about -yet.

That’s why people should vote for me. The other talks will be about “how we got Hadoop to work in a virtual world” -mine will be about how we improved Hadoop to work in a virtual world.

(*) ps, for anyone planning the many-email-accounts approach, remember that the email addresses are something we reviewers can look at, and many sequential accounts all doing three votes to a single talk will show up as “statistically significant”. Russ has the data, he likes his analyses. He may even have the IP addresses.

[Photo: an interview with Page 6 Guy at ApacheCon]

====

You can also access Steve’s blog here.

Go to page:12345...Last »