Hortonworks on Apache Hadoop


Enterprise Big Data Analytics with Hortonworks and Datameer

Today, 94% of Hadoop users perform analytics on large volumes of data that were not possible before. How do they do it? Cool applications, that’s how.

You have seen various stats that indicate enterprises need better ways of making use of data but they bear repeating: The volume of business data worldwide, across all companies, doubles every 1.2 years, according to a study published by eBay in May, 2012. And market research firm IDC released a forecast showing the big data market may grow from $3.2 billion in 2010 to $16.9 billion in 2015. Clearly, enterprises need better ways of making use of all of this data, which contains innumerable insights for improving business processes and profitability.

Hortonworks partner Datameer, has a horizontal application for big data discovery that includes self-service data integration, analytics and visualization on top of Hadoop, including pre-built analytic applications.…

Read More

Hortonworks at Yahoo! Hack Europe

Some news from the UK as Yahoo! Hack Europe welcomed Hortonworks this past weekend in central London.  This two-day event sponsored by Yahoo! was focused on celebrating collaboration, learning and innovation using the worlds leading technologies.  Chris Harris, our local EMEA Solution Engineer was on hand to add to the discussions.  Partnering with Microsoft, we were able to showcase our HDP on the Azure platform.  This was a fantastic opportunity for the 350 delegates to be expose to both Azure and enterprise ready Hadoop provided as HDInsight Service.

After an appearance of the Yahoo bigger than life, Hack Robot (seriously, check it out…), who made sure that everyone was entertained, the hack started with vengeance.  Hyped up on the sweetie cart full of everyone’s favorites, most delegates were now officially up for the challenge.  Inspired by the passion, Chris lead a thought provoking workshop, where a number of the hackers were able to try out real life scenarios on how Hadoop as part of the HDInsight service can and will be impacting business decisions.  …

Read More

Hadoop Summit Schedule is now available!

Now is the time to get registered for the Hadoop Summit in San Jose, 26-27 June, 2013 – we’d love to see you there. A few weeks ago, we revealed the selectees from the community choice voting, and we’re now delighted to announce the full schedule of sessions is available here.

Session Schedule

Our thanks to the track selection committees and track chairs for the work on building a great schedule for an awesome event. There are 70 sessions on the schedule so far with more to come later.

This year, the tracks are as follows:

  • Enterprise Data Architecture. This track focuses on Hadoop as a data platform and how it fits within broader enterprise data architectures.
  • Applications and Data Science. Sessions in this track focus on the practice of data science using Hadoop.
  • Deployment and Operations. This track focuses on the deployment, operation and administration of Hadoop clusters at scale.

Read More

Big Data Defined – Part Deux: Value Definition

A few weeks back we posted a definition of “big data”.  There was definitely some internal conversation about the term and if this definition had captured what the term means.  Sum finding: it is a loaded term.  It means a lot of different things to a lot of different people.

When I first joined Hortonworks, I bought in to the three V’s (volume velocity and variety) definition of big data.  It works for the most part, but is more a descriptor of the data.  It explains the characteristics of the data.  The definition is cold and lacks soul.  Afterall,  “big data” represents promise of “big” business value.

A “Value” Definition of Big Data

Last year, Shaun Connolly, Hortonworks VP of Corporate Strategy came up with this definition…
Big Data = Transactions + Interactions + Observations.

I gravitate to this because it outlines WHAT the data is, not just the characteristics. …

Read More

Week in Review: OpenStack, Data Science and Ambari

Almost time to spend a relaxing weekend in the garden, or crushing some data in your garage-based homebrew Hadoop cluster – whichever you prefer. But before we choose our path, let’s take a look at the last two weeks of happenings (I was lost in Oregon last week).

Hadoop is the perfect app for OpenStack. While I was struggling with driving directions, Red Hat, Marantis and Hortonworks were announcing plans for Project Savanna which aims to automate the deployment of Hadoop on enterprise-class OpenStack-powered clouds. Jim also wrote up some comprehensive notes from the awesome OpenStack Summit event.

Need Data Science? Here’s how to build a team. Ofer followed up his post on 4 Reasons to use Hadoop for Data Science post with some thinking on the continuum of skills and roles that represent a data science team. This proved to be something of a hot topic, and was referenced amongst some collective thinking on GigaOM.…

Read More

6 Key Hardware Considerations for Deploying Hadoop in Your Environment

To deploy, configure, manage and scale Hadoop clusters in a way that optimizes performance and resource utilization there is a lot to consider. Here are  6 key things to think about as part of your planning:

  1. Operating system:  Using a 64-bit operating system helps to avoid constraining the amount of memory that can be used on worker nodes. For example, 64-bit Red Hat Enterprise Linux 6.1 or greater is often preferred, due to better ecosystem support, more comprehensive functionality for components such as RAID controllers.
  2. Computation: Computational (or processing) capacity is determined by the aggregate number of Map/Reduce slots available across all nodes in a cluster. Map/Reduce slots are configured on a per-server basis. I/O performance issues can arise from sub-optimal disk-to-core ratios (too many slots and too few disks). HyperThreading improves process scheduling, allowing you to configure more Map/Reduce slots.

Read More

Hadoop and the Data Warehouse: When to Use Which

As a preview to the April 30th webinar: Hadoop & the Enterprise Data Warehouse: When to Use Which, Chad Meley, Global Director of Marketing at Teradata, interviewed the two luminary speakers, Eric Baldeschwieler (aka “eric14”) and Stephen Brobst, about the purpose of their presentation and what you can expect to take away from their shared experiences.

Chad:  “Eric, in this webinar you’re going to talk about the strategic role of relational big data technologies, which have come under fire in some circles with the rise of Hadoop.  As the Founder & CTO of Hortonworks, and former VP of Hadoop Software Engineering at Yahoo!, why do you feel this is an important message?”

Eric Baldeschwieler (eric14):  “We at Hortonworks are very optimistic about the continued growth of Hadoop, and there’s certainly a lot of media coverage, events, and communities that are aiding adoption and contributing to the future of Hadoop. …

Read More

Field Report: OpenStack Summit – The Hadoop Bizarro World

PORTLAND – The Rose city is a great place and this week it got even more interesting with the OpenStack Summit in town. I am more a data geek and very rarely do I venture down the stack into infrastructure, but wow, there is something cool going on with the OpenStack community.  I couldn’t help but to get wrapped up in the excitement.  Not only was the enthusiasm palpable, it was also very familiar. I don’t know if it was the organic buzz of Portland or not, but I felt a little like I was in Hadoop bizarro world.

Hadoop on OpenStack

Hortonworks was the only “app” vendor on the show floor and our story was well received.  When you partner with the leading code contributor (Red Hat) and the leading system integrator (Mirantis) and have existing relationships with the founders (Rackspace) of OpenStack, you get some relative street cred.…

Read More

Apache Hadoop and Data Agility

In a recent blog post I mentioned the 4 reasons for using Hadoop for data science. In this blog post I would like to dive deeper into the last of these reasons: data agility.

In most existing data architectures, based on relational database systems, the data schema is of central importance, and needs to be designed and maintained carefully over the lifetime of the project. Furthermore, whatever data fits into the schema will be stored, and everything else typically gets ignored and lost. Changing the schema is a significant undertaking, one that most IT organizations don’t take lightly. In fact, it is not uncommon for a schema change in an operational RDBMS system to take 6-12 months if not more.

Hadoop is different. A schema is not needed when you write data; instead the schema is applied when using the data for some application, thus the concept of “schema on read”.…

Read More

Field Notes: Apache Ambari Meetup at Hortonworks

On April 2nd, Hortonworks was excited to host the very first Apache Ambari Meetup. Thanks to all those who came along in person and virtually for a lot of vibrant discussion. If you would like to get involved in future Ambari Meetups, please visit this link. We are well on the way to making Hadoop management ‘dead simple’.

We have embedded the sessions below with some notes:

Overview and Demo of Ambari, Yusaku Sako, Hortonworks

  • This session covered Apache Ambari’s mission to “Make Hadoop management dead simple”, Ambari’s 4 major roles: 1) Provision, 2) Manage, 3) Monitor, and 4) Integrate, emphasized that everything that Ambari’s Web Client does is done thru Ambari’s REST API (100% REST), presented high-level architecture, and a live demo on how to provision, manage, and monitor a Hadoop cluster using the latest Ambari 1.2.2 release.

Read More

Hadoop, The Perfect App for OpenStack

The convergence of big data and cloud is a disruptive market force that we at Hortonworks not only want to encourage but also accelerate. Our partnerships with Microsoft and Rackspace have been perfect examples of bringing Hadoop to the cloud in a way that enables choice and delivers meaningful value to enterprise customers. In January, Hortonworks joined the OpenStack Foundation in support of our efforts with Rackspace (i.e. OpenStack-based Hadoop solution for the public and private cloud). [

Today, we announced our plans to work with engineers from Red Hat and Mirantis within the OpenStack community on open source Project Savanna to automate the deployment of Hadoop on enterprise-class OpenStack-powered clouds.

Why is this news important?

Because big data and cloud computing are two of the top priorities in enterprise IT today, and it’s our intention to work diligently within the Hadoop and OpenStack open source communities to deliver solutions in support of these market needs.…

Read More

How to Build a Hadoop Data Science Team

Data scientists are in high demand these days. Everyone seems to be hiring a team of data scientists, yet many are still not quite sure what data science is all about, and what skill set they need to look for in a data scientist to build a stellar Hadoop data science team. We at Hortonworks believe data science is an evolving discipline that will continue to grow in demand in the coming years, especially with the growth of Hadoop adoption. This role requires experience and knowledge in math, statistics and machine learning, programming and scripting, as well as visualization techniques.

We tend to think of the data scientist role as a continuum of skills:

Software engineers really enjoy crafting new production-grade software systems, that are testable and maintainable, secure and scale well. Some of those software engineers specialize in working with data.…

Read More

Week in Review: Patterns, Glue and Moonshots

The end of another action-packed week and just before we all head off for the weekend then let’s have a recap on the conversations from this week – we hope you’re enjoying them.

We’re delighted by the response to our Hadoop Patterns of Use whitepaper and presentation - that really seems to have struck a chord with everyone thinking about what Hadoop can really do for their business. You can see that content just below here – an excellent read for the journey home.

Also popular was the slides from one of our resident data scientists, Ofer Mendelevitch, who had 4 great reasons to use Hadoop for data science. He’ll be mining for more right now. Another article we liked from Stratconf explained the importance of imagination in data science.

 

Mid-week, we turned our attention to the awesomeness of HCatalog and spent a little time geeking out on the capabilities it provides as the glue for all your data. …

Read More

HP Moonshot: Big Potential for Big Data & Hadoop

While we are quite a far way away from hearing “Houston, tranquility base here… the eagle has landed”, the HP moonshot is definitely pushing us all toward a new class of infrastructure to run more efficient workloads, like Apache Hadoop. Hortonworks applauds the development of flexible Big Data appliances like Moonshot. We are excited about this development as it signals alignment across development, operations and infrastructure within organizations.  For quite some time, our team has been accustomed to a natural balance required across these three constituents and now the server the market is joining in on the game.

We agree with our friend, Jeff Kelly at Wikibon in that “Big Data as one example of a workload that requires a lot of low level optimization. One of the main reasons is that Hadoop clusters are scaled over time in response to increased usage, and factors like power efficiency and the physical footprint of servers become major considerations as the environment grows in size.”…

Wait! 

Read More

4 Reasons to use Hadoop for Data Science

Over the last 10 years or so, large web companies such as Google, Yahoo!, Amazon and Facebook have successfully applied large scale machine learning algorithms over big data sets, creating innovative data products such as online advertising systems and recommendation engines.

Apache Hadoop is quickly becoming a central store for big data in the enterprise, and thus is a natural platform with which enterprise IT can now apply data science to a variety of business problems such as product recommendation, fraud detection, and sentiment analysis.

Building on the patterns of Refine, Explore, Enrich that we described in our Hadoop Patterns of Use whitepaper, let’s review some of the major reasons to use Hadoop for data science which are also capture in the following presentation:

 

Reason 1: Data exploration with full datasets

Data scientists love their working environment. Whether using R, SAS, Matlab or Python, they always need a laptop with lots of memory to analyze data and  build models.…

Read More

Go to page:12345...10...Last »