Apache Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Hadoop services provide for data storage, data processing, data access, data governance, security, and operations.
The genesis of Hadoop came from the Google File System paper that was published in October 2003. This paper spawned another research paper from Google – MapReduce: Simplified Data Processing on Large Clusters. Development started in the Apache Nutch project, but was moved to the new Hadoop subproject in January 2006. The first committer added to the Hadoop project was Owen O’Malley in March 2006. Hadoop 0.1.0 was released in April 2006 and continues to be evolved by the many contributors to the Apache Hadoop project. Hadoop was named after one of the founder’s toy elephant.
In 2011, Rob Bearden partnered with Yahoo! to establish Hortonworks with 24 engineers from the original Hadoop team including founders Alan Gates, Arun Murthy, Devaraj Das, Mahadev Konar, Owen O’Malley, Sanjay Radia, and Suresh Srinivas.
Some of the reasons organizations use Hadoop is its’ ability to store, manage and analyze vast amounts of structured and unstructured data quickly, reliably, flexibly and at low-cost.
The Hadoop Distributed File System (HDFS) provides scalable, fault-tolerant, cost-efficient storage for your big data lake. It was designed to span large clusters of commodity servers scaling up to hundreds of petabytes and thousands of servers. By distributing storage across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.
MapReduce is the original framework for writing massively parallel applications that process large amounts of structured and unstructured data stored in HDFS. MapReduce can take advantage of the locality of data, processing it near the place it is stored on each node in the cluster in order to reduce the distance over which it must be transmitted.
More recently, Apache Hadoop YARN opened Hadoop to other data processing engines that can now run alongside existing MapReduce jobs to process data in many different ways at the same time, such as Apache Spark. YARN provides the centralized resource management that enables you to process multiple workloads simultaneously. YARN is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize a modern data architecture.
Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data.
Applications can interact with the data in Hadoop using batch or interactive SQL (Apache Hive) or low-latency access with NoSQL (Apache HBase). Hive allows business users and data analysts to use their preferred business analytics, reporting and visualization tools with Hadoop. Data stored in HDFS in Hadoop can be searched using Apache Solr.
The Hadoop ecosystem extends data access and processing with powerful tools for data governance and integration including centralized security administration (Apache Ranger) and data classification tagging (Apache Atlas), which combined enable dynamic data access policies that proactively prevent data access violations from occurring. Hadoop perimeter security is also available to integrate with existing enterprise security systems and control user access to Hadoop (Apache Knox).
A very common request from many customers is to be able to index text in image files; for example, text in scanned PNG files. In this tutorial we are going to walkthrough how to do this with SOLR. Prerequisites Download the Hortonworks Sandbox Complete the Learning the Ropes of the HDP Sandbox tutorial. Step-by-step guide […]
Overview The Azure cloud infrastructure has become a commonplace for users to deploy virtual machines on the cloud due to its’ flexibility, ease of deployment, and cost benefits. In addition, Microsoft has expanded Azure to include a marketplace with thousands of certified, open source, and community software applications, developer services, and data—pre-configured for Microsoft Azure. […]
Introduction Hortonworks has recently announced the integration of Apache Atlas and Apache Ranger, and introduced the concept of tag or classification based policies. Enterprises can classify data in Apache Atlas and use the classification to build security policies in Apache Ranger. This tutorial walks through an example of tagging data in Atlas and building a […]
Introduction Hortonworks introduced Apache Atlas as part of the Data Governance Initiative, and has continued to deliver on the vision for open source solution for centralized metadata store, data classification, data lifecycle management and centralized security. Atlas is now offering, as a tech preview, cross component lineage functionality, delivering a complete view of data movement […]
Apache Zeppelin on HDP 2.4.2 Author: Vinay Shukla In March 2016 we delivered the second technical preview of Apache Zeppelin, on HDP 2.4. Meanwhile we and the Zeppelin community have continued to add new features to Zeppelin. These features are now available in the final technical preview of Apache Zeppelin. This technical preview works with […]
Introduction In this tutorial, we will give you a taste of the powerful Machine Learning libraries in Apache Spark via a hands-on lab. We will also introduce the necessary steps to get you up and running with Apache Zeppelin on a Hortonworks Data Platform (HDP) Sandbox. Prerequisites This tutorial is a part of series of […]
In this tutorial we will explore how you can use policies in HDP Advanced Security to protect your enterprise data lake and audit access by users to resources on HDFS, Hive and HBase from a centralized HDP Security Administration Console.
Introduction Apache Ranger delivers a comprehensive approach to security for a Hadoop cluster. It provides a central security policy administration across the core enterprise security requirements of authorization, accounting and data protection. Apache Ranger already extends baseline features for coordinated enforcement across Hadoop workloads from batch, interactive SQL and real–time in Hadoop. In this tutorial, […]
(Image Courtesy – www.theastuteadvisor.com) “Perhaps more than anything else, failure to recognize the precariousness and fickleness of confidence-especially in cases in which large short-term debts need to be rolled over continuously-is the key factor that gives rise to the this-time-is-different syndrome.Highly indebted governments, banks, or corporations can seem to be merrily rolling along for an extended period, when bang!-confidence collapses, […]
It has been another exciting week on Hortonworks Community Connection HCC. We continue to see great activity and recommend the following assets from last week. Top Articles from HCC Phoenix HBase Tuning – Quick Hits by:smanjee HBase tuning like any other service within the ecosystem requires understanding of the configurations and the impact (good or […]
Capital Markets are the face of the financial industry to the general public and generate a large percent of the GDP for the world economy. Despite all the negative press they have garnered since the financial crisis of 2008, capital markets perform an important social function in that they contribute heavily to economic growth and are […]
Following the success of our sold-out 2015 Roadshow, we are pleased to announce our worldwide Future of Data Roadshow 2016! The Roadshow brings the innovators driving the future of data to you and offers insightful content for both business and technical attendees. This is an invaluable opportunity to network with leaders who are transforming their business […]
It has been another exciting week on Hortonworks Community Connection HCC. We continue to see great activity and recommend the following assets from last week. Top Articles from HCC Horses for Courses: Apache Spark Streaming and Apache Nifi by:vvaks Comparing Apache Nifi and Apache Spark Streaming for different streaming and IOT use cases Data Analysis […]
Hadoop Summit in San Jose wrapped up a few weeks ago. This was the ninth year and, wow, have we come a long way. It’s been a decade for Apache Hadoop and five years for Hortonworks. Hadoop Summit is the leading conference for Hadoop and data management, and this year saw well over 4,000 attendees […]
It has been another exciting week on Hortonworks Community Connection HCC. We have lots of great technical content and are continuing to see great activity. We recommend the following assets from last week: Top Articles from HCC Disaster recovery and Backup best practices in a typical Hadoop Cluster :Series 1 Introduction by:rbiswas Disaster recovery plan […]
It has been another exciting week on Hortonworks Community Connection HCC. We have lots of great technical content and are continuing to see great activity. We recommend the following assets from last week: Top Articles from HCC Adding KDC Administrator Credentials to the Ambari Credential Store by:rlevas Rack Awareness by:rbiswas Spark+Pycharm+Pybuilder on Docker by:smanjee YARN […]
“The data fabric is the next middleware.” –Todd Papaioannou, CTO at Splunk Enterprises across the globe are confronting the need to create a Digital Strategy. While the term itself may seen intimidating to some, to the business it essentially implies an agile culture built on customer centricity & responsiveness. The only way to attain Digital success […]
I was back ‘home’ for Hadoop Summit San Jose last week and I have to admit, it was fantastic to be hosting our customers and partners from across Europe, Middle East, Africa and Asia! It was a true testament to the relationships I’ve seen develop first hand within our international business over the past 12 […]
According to Strategy Meets Action (SMA), the value and disruption do not come from the “things” or the technology itself. New, actionable insights can be gleaned from massive amounts of new data being collected and analyzed. Insurers must build strong enterprise-wide data management and analytics capabilities to be in a position to capitalize on these […]
The first decade is over and we’re entering the second. One industry watcher makes a great point: Awkward teenage years ahead? I don’t believe we’ll be one of those ‘difficult’ teenagers. We might be a bit of a nerd, but we’ll be the well balanced one. The one with friends, the one that goes to […]
This week we made a huge step forward in accelerating genomics-based precision medicine in research and clinical care, starting a consortium of experts and organizations who will help to define the next generation of genomics research. We’ve already been joined by Arizona State University, Baylor College of Medicine, Booz Allen Hamilton, Mayo Clinic, OneOme and […]
Big data is changing the way enterprises interact with and consume data. Modern data platforms, such as Hortonworks Data Platform (HDP) and Hortonworks Data Flow (HDF), are driving a data revolution by powering new workloads and analytic applications. This week, there are thousands of attendees in San Jose at Hadoop Summit 2016 learning about the […]
Earlier today, we announced that Open Energi has tripled its investment in data analytics by adding Hortonworks DataFlow to its existing use of Hortonworks Data Platform to manage its data in motion. Open Energi is at the forefront of smart grids and the Internet of Things in the UK. Put simply, it allows its customers’ […]
Today we announced Microsoft Azure HDInsight as our Premier Connected Data Platforms cloud solution, providing customers Apache™ Hadoop® as a fully managed cloud service. The announcement is very timely as this week, Hortonworks and Microsoft are celebrating the 10th anniversary of Hadoop at the Hadoop Summit 2016. Highlighting HDInsight as our premier cloud offering helps […]
This is part two of a two-part series from Hadoop Summit. In his post, Rob Beardon talks about how data transforms everything and the need for Connected Data Platforms. As a follow on, here’s four predictions for technologies behind this transformation. #1 — Intelligent Self-Configuring Networks Will Enable New & Faster Delivery of Data and Analytics Across Data […]
This morning at Hadoop Summit, I gave a talk about how data will transform everything and the need for Connected Data Platforms. Let me explain what I mean. Data is Transforming the Enterprise Until now, enterprise data was largely structured and not particularly diverse. This reality spawned what we know now as traditional IT best […]
Early this year, we announced our partnership with Pivotal and Syncsort, incorporating key technologies from the ecosystem to optimize the value from Hortonworks Connected Data Platforms. Today, I am very excited to announce an addition with our partnership to provide global access to and resell AtScale. Customers are constantly asking us to find simpler, faster […]
Hadoop Summit San Jose is here once again and with it comes a reminder of the power of the Open Source Community and the tremendous innovation which continues to occur within the Apache Hadoop ecosystem. At Hortonworks, we get the opportunity to engage with this vibrant, creative, and talented group of engineers all year round, […]
Water, water everywhere, Nor any drop to drink These lines from “The Rime of the Ancient Mariner,” by Samuel Taylor Coleridge also accurately describe the companies that are trying to transform themselves into a data driven company. These organizations have astronomical volumes of raw data at their disposal but how do they find that proverbial […]
We are right on the verge of some great celebrations of 10 years of Apache Hadoop! Hadoop Summit San Jose 2016 is almost here too marking these celebrations! Held on June 28-30, 2016, it is the event for technical and business audiences to learn how big data continues to a major force in transforming the […]
“The world is one big data problem.” Andrew McAfee, associate director of the Center for Digital Business at MIT Sloan One whole year of almost daily client meetings & discussions with industry leaders have helped me see crystallize my view of an important yet abstract idea into reality. That is, Big Data capabilities or the lack of […]
Today we announced our Global Professional Services (GPS) program and new offerings to help enable Hortonworks Connected Data Platforms’ customers with the implementation, advisory and managed services. GPS program and new Hortonworks Services offerings are tailored to meet the expertise needs at any stage of data platform adoption journey and will help the enterprises to […]
At Hadoop Summit San Jose we are excited to be joined by industry experts from the industry. Here are just a few of the business focussed sessions, but you need to register to attend Hadoop Summit. What is Data? And What Are You Doing? Speaker: Russell Foltz-Smith from RFS Productions Abstract: We all talk about and do […]
Apache, Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie, Metron and the Hadoop elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States or other countries.