Hortonworks on Apache Hadoop


Teradata Aster & Hortonworks Webinar on Thursday

I wanted to draw your attention to a Webinar taking place this Thursday at 1pm EDT, 10am PDT. “Back to the Future – MapReduce, Hadoop and the Data Scientist” will highlight the benefits of Apache Hadoop and the role that data scientists are playing in big data. The speakers include:

  • Colin White – Founder of BI Research, a leading research, education and consulting firm helping companies understand and benefit from evolving and leading edge technologies in the areas of business intelligence and data management.
  • Tasso Argyros – Co-President of Teradata Aster
  • Ari Zilka – Chief Products Officer for Hortonworks

Among the topics discussed during this free Webinar are:

  • MapReduce for the data scientist: Hadoop/Hive and RDBMS approaches
  • Back to the future: file systems vs. database systems
  • Hadoop and RDBMS coexistence strategies
  • Bridging the gap: new approaches for analyzing data using Hadoop

This promises to be a very interesting and informative presentation so please Register today.

Read More

Introducing Hortonworks Data Platform v1.0

I wanted to take this opportunity to share some important news. Today, Hortonworks announced version 1.0 of the Hortonworks Data Platform, a 100% open source data management platform based on Apache Hadoop. We believe strongly that Apache Hadoop, and therefore, Hortonworks Data Platform, will become the foundation for the next generation enterprise data architecture, helping companies to load, store, process, manage and ultimately benefit from the growing volume and variety of data entering into, and flowing throughout their organizations. The imminent release of Hortonworks Data Platform v1.0 represents a major step forward for achieving this vision.

You can read the full press release here. You can also read what many of our partners have to say about this announcement here. We were extremely pleased that industry leaders such as Attunity, Dataguise, Datameer, Karmasphere, Kognitio, MarkLogic, Microsoft, NetApp, StackIQ, Syncsort, Talend, 10gen, Teradata and VMware all expressed their support and excitement for Hortonworks Data Platform.…

Read More

Announcing General Availability of Hortonworks Data Platform

The following press release was issued by Hortonworks today.

Hortonworks Announces General Availability of Hortonworks Data Platform

Industry’s First Apache Hadoop-based Platform to Include Management, Monitoring and Comprehensive Data Services, Making Hadoop Easy to Consume and Use in Enterprise Environments

Read More

An Advance Look at Hadoop Summit

Hadoop Summit is just around the corner and by that, I mean next week! There is still time to register for the conference but please do it soon as the conference is filling up quickly. Today is also the last day in which online registration will remain open. After today, you will need to register on-site at the conference itself.

This year’s Hadoop Summit conference, now in its fifth year, promises to be the biggest and best yet. In fact, there are already more people registered for Hadoop Summit 2012 than any other Hadoop conference ever!

I wanted to take this opportunity share some of the highlights for next week’s conference:

Geoffrey Moore and Other Compelling Keynote Speakers:

Geoffrey Moore, author of “Crossing the Chasm” and “Escape Velocity”, will share his views on “Digitizing the World, the Driving Force Behind Apache Hadoop’s Adoption Life Cycle”.…

Read More

Balancing Community Innovation and Enterprise Stability

Having worked at JBoss and Red Hat from 2004 to 2008 and SpringSource and VMware from 2008 to 2011, I’ve been focused on the world of open source software for a long while. I’ve been blessed to be able to serve enterprise customer needs with high quality open source software such as JBoss Application Server, Hibernate, Drools, Apache Web Server, Apache Tomcat, Spring … and now Apache Hadoop.

As specific open source technologies mature and their use becomes mainstream, it becomes increasingly important to understand and communicate the balancing act that needs to happen between community innovation and enterprise stability.

Community innovation needs to have a fast pace, where “ship early and often” is a key tenet.  Open source projects need to visibly improve and keep innovating if they are to attract a vibrant following. As the open source project’s community grows, they will expect big improvements and will be fine with early, buggy releases, etc.…

Read More

The Data Lifecycle, Part Two: Mining Avros with Pig, Consuming Data with HIVE

Series Introduction

This is part two of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data.  In a series of posts, we’re going to explore the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in HIVE, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

Part one of this series is available here.

Code examples for this post are available here: https://github.com/rjurney/enron-hive.

In the last post, we used Pig to Extract-Transform-Load a MySQL database of the Enron emails to document format and serialize them in Avro. Now that we’ve done this, we’re ready to get to the business of data science: extracting new and interesting properties from our data for consumption by analysts and users.…

Read More

Apache Hadoop 2.0 (Alpha) Released

As the release manager for the Apache Hadoop 2.0 release, it gives me great pleasure to share that the Apache Hadoop community has just released Apache Hadoop 2.0.0 (alpha)! While only an alpha release (read: not ready to run in production), it is still an important step forward as it represents the very first release that delivers new and important capabilities, including:

Read More

The Data Lifecycle, Part One: Avroizing the Enron Emails

Series Introduction

This is part one of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data.  In a series of posts, we’re going to explore the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in HIVE, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

The Berkeley Enron Emails

In this project we will convert a MySQL database of Enron emails into Avro document format for analysis on Hadoop with Pig. Complete code for this example is available on here on github.

Email is a rich source of information for analysis by many means.…

Read More

Executive Video Series: Introduction to HDFS

The latest video in the Hortonworks Executive Video Series features Sanjay Radia, Hortonworks co-founder and Apache Hadoop PMC member. Sanjay is well known in the HDFS circles, having contributed to Hadoop for the past 4+ years. In this video, Sanjay gives a good overview of HDFS, the primary storage system for Hadoop, and provides some insight into both the 0.23 release as well as what can be expected from HDFS over the rest of 2012. He hits on some key elements such as federation, snapshots and improving the overall storage efficiency of HDFS.

If you would like to learn more about HDFS and where it is heading, make sure to attend Hadoop Summit next month in San Jose. At the conference, Sanjay will be presenting HDFS – What is New and Future together with Suresh Srinivas of Hortonworks, as well as Apache Hadoop and Virtual Machines together with Richard McDougall of VMWare.…

Read More

Apache HCatalog 0.4.0 Released

In case you didn’t see the news, I wanted to share the announcement that HCatalog 0.4.0 is now available.

For those of you that are new to the project, HCatalog provides a metadata and table management system that simplifies data sharing between Apache Hadoop and other enterprise data systems. You can learn more about the project on the Apache project site.

The highlights of the 0.4.0 release include:

- Full support for reading from and writing to Hive.
- Support for deeply nested maps, arrays, and structs.
- Switch from StorageDrivers to SerDes. HCatalog no longer supports its own StorageDriver classes for data (de)serialization. Instead it uses Hive’s SerDe classes.
- Addition of JSonSerDe to support reading and writing JSON data.
- The HCatalog binary distribution no longer includes Apache Hive. We now require that Hive first be installed.
- The HCatalog source distribution no longer includes Apache Hive source.…

Read More

Big Data Refinery Fuels Next-Generation Data Architecture

Since joining Hortonworks at the beginning of the year, a question I’ve heard over and over again is “What is Apache Hadoop and what is it used for?”

There’s clearly a lot of hype [and confusion] in this emerging Big Data market, and it feels as if each new technology, as well as existing technologies, are pushing the meme of “all your data are belong to us”. It is great to see the wave of innovation occurring across the landscape of SQL, NoSQL, NewSQL, EDW, MPP DBMS, Data Marts, and Apache Hadoop (to name just a few), but enterprises and the market in general can use a healthy dose of clarity on just how to use and interconnect these various technologies in ways that benefit the business.

In my post entitled 7 Key Drivers for the Big Data Market, I asserted that the Big Data movement is not only about the classic world of transactions, but it factors in the new(er) worlds of interactions and observations.…

Read More

7 Key Drivers for the Big Data Market

I attended the Goldman Sachs Cloud Conference and participated on a panel focused on “Data: The New Competitive Advantage”. The panel covered a wide range of questions, but kicked off covering two basic questions:

“What is Big Data?” and “What are the drivers behind the Big Data market?”

While most definitions of Big Data focus on the new forms of unstructured data flowing through businesses with new levels of “volume, velocity, variety, and complexity”, I tend to answer the question using a simple equation:

Big Data = Transactions + Interactions + Observations

The following graphic illustrates what I mean:

Read More

Executive Video Series: Introduction to HCatalog

We just added a video to the Hortonworks Executive Video library that features Alan Gates, Hortonworks co-founder and Apache PMC member. In this video, Alan discusses HCatalog, one of the most compelling projects in the Apache Hadoop ecosystem.

HCatalog is a metadata and table management system that provides a consistent data model and schema for users of tools such as MapReduce, Hive and Pig. When you consider that there are often users accessing Hadoop clusters using different tools that independently don’t agree on schema, data types, how and where data is stored, etc., then you can understand the value of having a tool such as HCatalog.

In this video, Alan does a good job of not only explaining the role of HCatalog, but also laying out the future direction of the project. He talks about improving the integration with HBase, improving information lifecycle management and expanding the HCatalog data model to address the challenges of unstructured data.…

Read More

Record Support for Hadoop Summit

In case you didn’t see the news today, Hadoop Summit announced record ecosystem support for this year’s conference. The original and world’s largest Apache Hadoop conference, now in its fifth year, is being sponsored this year by more than 40 traditional and open source software and services companies.

Hortonworks and our co-host Yahoo! would like to thank the following companies for helping to make Hadoop Summit possible:

Read More

Executive Video Series: Apache Hadoop and Next Generation MapReduce

The third installment of the Hortonworks executive video series features Arun C. Murthy, co-founder of Hortonworks and VP of Apache Hadoop for the Apache Software Foundation. In this video, Arun shares his view of the power of Apache Hadoop and provides some insight into the future direction of MapReduce, including the ability to support alternate programming paradigms.

Read More

Go to page:« First...1112131415...Last »