Hortonworks on Apache Hadoop


UC Irvine Health: Improving Quality of Care with Apache Hadoop (Part 2)

This is the second part of a series written by Charles Boicey from UC Irvine Health (part 1 is here). The series will demonstrate a real case study for Apache Hadoop in healthcare and also journal the architecture and technical considerations presented during implementation.

It has been 232 days since the last post. Much has transpired including a rebranding of the organization from UCI Medical Center to UC Irvine Health. I am happy to report we have a production Saritor environment up and running on the Hortonworks Data Platform.

Here are some highlights from the past 232 days:

Home Monitoring

In collaboration with our medical device integration partner, iSirona, we are developing a system to acquire home monitoring data and transmit it to Saritor. Our first deployed device will be a scale. This may sound simple, but in-home monitoring of the daily weights of Congestive Heart Failure patients is essential for the prevention of those patients readmitting to the hospital.…

Read More

Hive/HCatalog – Data Geeks & Big Data Glue

Unstructured data, semi-structured data, structured data… it is all very interesting and we are in conversations about big and small versions of each of these data types every day. We love it…  we are data geeks at Hortonworks. We passionately understand that if you want to use any piece of data for some computation, there needs to be some layer of metadata and structure to interact with it.  Within Hadoop, this critical metadata service is provided by HCatalog.

As a key component of Apache Hive, HCatalog is a metadata and table management system for the broader Hadoop platform. It enables the storage of data in any format regardless of structure. Hadoop can then process both structured and unstructured data and it can store and share information about data’s structure in HCatalog. This capability combined with the ‘schema on read’ nature of Hadoop versus traditional EDW ‘schema on write’ reduces cycle time for data scientists seeking insight as it encourages exploration and discovery on a continuous basis.…

Read More

Hortonworks Sandbox: Dreaming Up New Tutorials For You

We’re cooking up some new tutorials for you to play with in your Hortonworks Sandbox to help you learn more about the Hortonworks Data Platform, Apache Hadoop, Hive, Pig and HCatalog, with maybe a smattering of Mahout in there as well.

More about Sandbox »

While you’re anxiously awaiting, we thought we’d give you some pointers to some resources so that you can experiment and play. After all, that’s what a Sandbox is all about, right?

Language Manuals

First, if you’re looking to expand your skills, take a look at Hive Language Manual, the Pig Tutorial on the Apache Foundation website, and Command Line Interface information on HCatalog project incubator site.

Use Hive to SQLize

Feeling a bit more advanced? Take a look at Russell Jurney’s blog posts, HOWTO use Hive to SQLize your own Tweets Part 1, and HOWTO use Hive to SQLize your own Tweets Part 1.…

Read More

Apache Hadoop Patterns of Use: Refine, Enrich and Explore

“OK, Hadoop is pretty cool, but exactly where does it fit and how are other people using it?”  Here at Hortonworks, this has got to be the most common question we get from the community… well that and “what is the airspeed velocity of an unladen swallow?”

We think about this (where Hadoop fits) a lot and have gathered a fair amount of expertise on the topic.  The core team at Hortonworks includes the original architects, developers and operators of Apache Hadoop and its use at Yahoo, and through this experience and working within the larger community they have been privileged to see Hadoop emerge as the technological underpinning for so many big data projects. That has allowed us to observe certain patterns that we’ve found greatly simplify the concepts associated with Hadoop, and our aim is to share some of those patterns here.

Read More

Where are Hortonworkers? Events and Meetups 8th April to 22nd April

Hortonworkers are out there – here is a rundown of events and meet ups we’ll be at in the next couple of weeks and we hope we’ll see you there. Did we miss any? Want us to attend your event? Let us know!

Big Data Innovation Summit

April 10-11, 2013, San Francisco, CA

http://theinnovationenterprise.com/summits/big-data-innovation-summit-april-2013-san-francisco

Spring into April and jump into Big Data! Be sure to meet us at Big Data Innovation Summit by the bay. We’re excited to have Alan Gates, co-founder of Hortonworks, presents on a couple of really exciting talks and we hope you can join us.

  •  April 11 @9:30am: Coordinating the Many Tools of Big Data in Hadoop
  •  April 11 @ 12:30pm: Hadoop Now, Next and Beyond
  •  April 11 @ 2:00pm: Roundtable Session: Use Case Patterns: Horizontal or Vertical

As a global sponsor, we’ll also be exhibiting. Look for us in the exhibit area and meet members of the Hortonworks team, who will be happy to discuss any questions you have on Hadoop and Hortonworks.…

Read More

Week in Review: Falcon, Hadoop Momentum and BFFs Forever!

More of a 2 weeks in review this time around owing to the Easter break. So what’s been happening?

Falcon bringing Data Lifecycle Management for Hadoop. The big news this week was the newly approved Apache Software Foundation incubator project – Falcon. The project was initiated by the team at InMobi and engineers from Hortonworks towers with the intent of simplifying data management through a data lifecycle management framework. Something for everyone then. More on Falcon here. Once again, it’s a great example of community driven open source driving the innovation that matters, or as Mohit Saxena of InMobi said:

Want to be BFFs with Hortonworks? According to this article on TechWorld, everyone does, and Neustar details why. We’re flattered by the sentiment and we’d love to be your friend. You can ‘Like’ us over here.

Market Momentum. So, with all of the innovation and buzz around Hadoop and Hortonworks, what does that mean for you, me, or anyone looking to dip a toe in the water?…

Read More

Integrating Apache Hadoop and SAP

With any enterprise software implementation, the challenge is often the integration of a chosen system with existing enterprise systems architecture. One such existing investment may be an ERP (and related) systems such as those provided by SAP. In this real-world instance, SAP partnered with Hortonworks to enable integration of Apache Hadoop into SAP Real-Time Data Platforms using Hortonworks Data Platform to facilitate business intelligence and analysis of Big Data.

The business challenges at hand will be familiar to everyone and are a great fit for a Hadoop solution. These are:

  • Data does not fit neatly in a relational format. The customer gathers more than one hundred million surveys each year. The most valuable data is in the “comments” field which is unstructured and therefore not analyzed.
  • The business cannot view data across departments. Customer training data, for example, is not typically joined across departments with the call center’s CRM application to help tailor a support call to the customer’s expertise.

Read More

Big Data Defined

‘Big Data’ has become a hot buzzword, but a poorly defined one. Here we will define it.

Wikipedia defines Big Data in terms of the problems posed by the awkwardness of legacy tools in supporting massive datasets:

In information technology, big data[1][2] is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

It is better to define ‘Big Data’ in terms of opportunity, in terms of transformative economics. Big Data is the opportunity space created by new open source, distributed systems from the consumer internet space.

Specifically, a Big Data system has four properties:

  • It uses local storage to be fast but inexpensive
  • It uses clusters of commodity hardware to be inexpensive
  • It uses free software to be inexpensive
  • It is open source to avoid expensive vendor lock-in

Cheap storage means logging enormous volumes of data to many disks is easy.…

Read More

Project Falcon: Tackling Hadoop Data Lifecycle Management via Community Driven Open Source

Today we are excited to see another example of the power of community at work as we highlight the newly approved Apache Software Foundation incubator project named Falcon. This incubation project was initiated by the team at InMobi together with engineers from Hortonworks. Falcon is useful to anyone building apps on Hadoop as it simplifies data management through the introduction of a data lifecycle management framework.

All About Falcon and Data Lifecycle Management

Falcon is a data lifecycle management framework for Apache Hadoop that enables users to configure, manage and orchestrate data motion, disaster recovery, and data retention workflows in support of business continuity and data governance use cases.

 

Falcon’s goal is to simplify data management on Hadoop and achieves this by providing important data lifecycle management services that any Hadoop application can rely on. Instead of hard-coding complex data lifecycle capabilities, apps can now rely on a proven, well-tested and extremely scalable data management system built specifically for the unique capabilities that Hadoop offers.…

Read More

Keynotes from Hadoop Summit Amsterdam 2013

The slides and videos from Hadoop Summit in Amsterdam have begun to flow so you can enjoy the sessions.

Whilst you’re thinking about which sessions to watch and read, then we suggest taking a look at the keynotes for the event:
  • What is the point of Hadoop? (VideoSlides)
  • Matt Aslett, Research Director, Data Management and Analytics, 451 Research
  • Real-World insight into Hadoop in the Enterprise (Video)
  • Panel featuring HSBC, eBay, Neustar and More
We hope you enjoy these sessions, and the content from the tracks. Let us know in the comments! And don’t forget that there is plenty of time to register for Hadoop Summit San Jose 2013.

Read More

Hadoop Market Momentum and You

On 27th March, the Wall Street Journal published an article ‘VCs Bet Big Bucks on Hadoop’ and it seems clear that the market is going to be huge. But what does that mean to you and your personal skills investment? Here’s our view:

Hadoop is HOT

Hadoop is incredibly hot right now as the number of available jobs continues to grow enormously (hey – we even have a bunch of our own right here).

Indeed’s Job Trends shows Hadoop as 7th hottest skill and it’s in great company alongside those app development skills such as iOS, Android and jQuery. I guess that’s to be expected of course: insights from big data is the fuel to smartest apps of the future.

The Hadoop trend itself is fairly clear. In growth terms, that is pretty explosive!

 

A quick search on LinkedIn will pull back around 1200 Hadoop jobs right now (it was 1281 when I checked).…

Read More

Hadoop Summit North America 2013: Community Choice Results

And the voting is over and the results are in for the Community Choice program of the Hadoop Summit San Jose 2013.

With over 300 sessions, and around 6000 users casting more than 15000 votes there was a lot of excitement to participate and influence the results - thanks to everyone for your contribution. At the end of the process, the selectees are:

  • Application and Data Science Track: Watching Pigs Fly with the Netflix Hadoop Toolkit (Netflix)
  • Deployment and Operations Track: Continuous Integration for the Applications on top of Hadoop (Yahoo!)
  • Enterprise Data Architecture Track: Next Generation Analytics: A Reference Architecture (Mu Sigma)
  • Future of Apache Hadoop Track: Jubatus: Real-time and Highly-scalable Machine Learning Platform (Preferred Infrastructure, Inc.)
  • Hadoop (Disruptive) Economics Track: Move to Hadoop, Go Fast and Save Millions: Mainframe Legacy Modernization (Sears Holding Corp.)
  • Hadoop-driven Business / BI Track: Big Data, Easy BI (Yahoo!)
  • Reference Architecture Track: Genie – Hadoop Platformed as a Service at Netflix (Netflix)

Congratulations to the selectees for each track, and a further honorable mention to Sears for winning the ‘Longest Session Title So Far’ which was a surprisingly hard fought contest!…

Read More

Understanding Hadoop 2.0

In this post, we’ll explain the difference between Hadoop 1.0 and 2.0. After all, what is Hadoop 2.0? What is YARN?

For starters – what is Hadoop and what is 1.0? The Apache Hadoop project is the core of an entire ecosystem of projects. It consists of four modules (see here):

  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Hadoop 1.0 is based on the Hadoop .20.205 branch (it went 0.18 -> 0.19 -> 0.20 -> 0.20.2 -> 0.20.205 -> 1.0). Hard to follow? Check out this chart. Not hard for an open source developer, but obscure for an enterprise product – so everyone agreed to call 0.20.205 ’1.0′, the project having matured to that point.…

Read More

Week in Review: Sandboxes, HDP 2.0 Alpha 2, Hive Performance and Summits

 It’s almost time for that final drive home of the week, and what a week it has been with a few new releases, a summit, and a little bit of technical fun. Here’s what happened:

New Sandbox Release. Yes, your favorite Hadoop VM image just got even better. Cheryle took us through the new features which included Ambari integration and Russell followed up with a quick tour of Ambari. There’s still plenty of time to download Sandbox for a weekend of data crunching fun.

HDP 2.0 Alpha 2 was released. This preview release demonstrates some of the performance improvements in store for the final HDP 2.0 release via YARN, enhancements to Hive per the Stinger Initiative, and Apache Tez. Just before the release, we posted some early test results which showed a 45X (yes, that’s forty five) performance improvement for Hive interactive queries.…

Read More

Go to page:12345...10...Last »