Apache Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Hadoop services provide for data storage, data processing, data access, data governance, security, and operations.
The genesis of Hadoop came from the Google File System paper that was published in October 2003. This paper spawned another research paper from Google – MapReduce: Simplified Data Processing on Large Clusters. Development started in the Apache Nutch project, but was moved to the new Hadoop subproject in January 2006. The first committer added to the Hadoop project was Owen O’Malley in March 2006. Hadoop 0.1.0 was released in April 2006 and continues to be evolved by the many contributors to the Apache Hadoop project. Hadoop was named after one of the founder’s toy elephant.
In 2011, Rob Bearden partnered with Yahoo! to establish Hortonworks with 24 engineers from the original Hadoop team including founders Alan Gates, Arun Murthy, Devaraj Das, Mahadev Konar, Owen O’Malley, Sanjay Radia, and Suresh Srinivas.
Some of the reasons organizations use Hadoop is its’ ability to store, manage and analyze vast amounts of structured and unstructured data quickly, reliably, flexibly and at low-cost.
The Hadoop Distributed File System (HDFS) provides scalable, fault-tolerant, cost-efficient storage for your big data lake. It was designed to span large clusters of commodity servers scaling up to hundreds of petabytes and thousands of servers. By distributing storage across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.
MapReduce is the original framework for writing massively parallel applications that process large amounts of structured and unstructured data stored in HDFS. MapReduce can take advantage of the locality of data, processing it near the place it is stored on each node in the cluster in order to reduce the distance over which it must be transmitted.
More recently, Apache Hadoop YARN opened Hadoop to other data processing engines that can now run alongside existing MapReduce jobs to process data in many different ways at the same time, such as Apache Spark. YARN provides the centralized resource management that enables you to process multiple workloads simultaneously. YARN is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize a modern data architecture.
Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data.
Applications can interact with the data in Hadoop using batch or interactive SQL (Apache Hive) or low-latency access with NoSQL (Apache HBase). Hive allows business users and data analysts to use their preferred business analytics, reporting and visualization tools with Hadoop. Data stored in HDFS in Hadoop can be searched using Apache Solr.
The Hadoop ecosystem extends data access and processing with powerful tools for data governance and integration including centralized security administration (Apache Ranger) and data classification tagging (Apache Atlas), which combined enable dynamic data access policies that proactively prevent data access violations from occurring. Hadoop perimeter security is also available to integrate with existing enterprise security systems and control user access to Hadoop (Apache Knox).
Overview The Azure cloud infrastructure has become a commonplace for users to deploy virtual machines on the cloud due to its’ flexibility, ease of deployment, and cost benefits. In addition, Microsoft has expanded Azure to include a marketplace with thousands of certified, open source, and community software applications, developer services, and data—pre-configured for Microsoft Azure. […]
A very common request from many customers is to be able to index text in image files; for example, text in scanned PNG files. In this tutorial we are going to walkthrough how to do this with SOLR. Prerequisites Download the Hortonworks Sandbox Complete the Learning the Ropes of the HDP Sandbox tutorial. Step-by-step guide […]
In this tutorial we will explore how you can use policies in HDP Advanced Security to protect your enterprise data lake and audit access by users to resources on HDFS, Hive and HBase from a centralized HDP Security Administration Console.
Introduction Apache Ranger delivers a comprehensive approach to security for a Hadoop cluster. It provides a central security policy administration across the core enterprise security requirements of authorization, accounting and data protection. Apache Ranger already extends baseline features for coordinated enforcement across Hadoop workloads from batch, interactive SQL and real–time in Hadoop. In this tutorial, […]
Overview Apache Ambari is a completely open operational framework for provisioning, managing and monitoring Apache Hadoop clusters. Ambari includes an intuitive collection of operator tools and a set of APIs that mask the complexity of Hadoop, simplifying the operation of clusters. In this tutorial, we will walk through the some of the key aspects of […]
Introduction Hortonworks has recently announced the integration of Apache Atlas and Apache Ranger, and introduced the concept of tag or classification based policies. Enterprises can classify data in Apache Atlas and use the classification to build security policies in Apache Ranger. This tutorial walks through an example of tagging data in Atlas and building a […]
This tutorial will help you get started with Hadoop and HDP. We will use an Internet of Things (IoT) use case to build your first HDP application.
Introduction Apache Falcon simplifies the configuration of data motion with: replication; lifecycle management; lineage and traceability. This provides data governance consistency across Hadoop components. Scenario In this tutorial we will walk through a scenario where email data gets processed on multiple HDP 2.2 clusters around the country then gets backed up hourly on a cloud […]
Hortonworks DataFlow has been seeing great success being deployed in multiple use cases. We recently shared a set of real-world use cases on a webinar, and also wanted to share here so readers can peruse which types of uses cases are being implemented and see if there are parallels between current and future users of […]
Wealth Management is the highest growth businesses for any medium to large financial institution. It also is the highest customer touch segment of banking and is fostered on long term (read extremely lucrative advisory) relationships. This three part series explores the automated “Robo-advisor” movement in the first post. We will cover the business background and some definitions . The second post will focus on the […]
It’s no secret that there is a data explosion. A recent IDC analyst report from April 2014 indicated the volume of data, known as the digital universe, is doubling in size every two years. And by 2020, there will be as many digital bits as there are stars in the universe. There are many reasons […]
Guest author: Jeff Kelly, Data Strategist, Pivotal The phrase “digital transformation” gets bandied about a lot these days, but what exactly does it mean? When you strip away the hyperbole, I believe digital transformation is the process by which enterprises evolve from using traditional information technology to merely support existing business models to adopting modern […]
Provenance, Lineage & Chain of Custody The models of Provenance, Lineage and Chain of Custody are used in fine art to determine when a piece was created, the sequence of locations where it was held, how it was touched along the way, and who has owned it since creation, all with the purpose of authenticating the piece. […]
The first post in this three part series on Digital Foundations introduces the concept of Customer 360 or Single View of Customer (SVC). We will discuss the need for & the definition of the SVC as part of the first step in any Digital Transformation endeavor. We will also discuss specific benefits from both a […]
People often think about cloud architecture in simplistic terms: you’re either public, private, or hybrid. (In fact, there’s even confusion about the meaning of the term “hybrid” itself—this video helps clear it up: In the real world, of course, virtually every implementation is hybrid—no company puts 100% of its IT environment into one single cloud. […]
The 100% open source and community driven innovation of Apache Hive 2.0 and LLAP (Long Last and Process) truly brings agile analytics to the next level. It enables customers to perform sub-second interactive queries without the need for additional SQL-based analytical tools, enabling rapid analytical iterations and providing significant time-to-value. TRY HIVE LLAP TODAY Read about […]
Apache Hive(™) is the most complete SQL on Hadoop system, supporting comprehensive SQL, a sophisticated cost-based optimizer, ACID transactions and fine-grained dynamic security. Though Hive has proven itself on multi-petabyte datasets spanning thousands of nodes many interesting use cases demand more interactive performance on smaller datasets, requiring a shift to in-memory. Hive 2 marks the […]
The Financial regulators are driving a Data Evolution Traditionally technology moves fast, regulators react slow. When technology leaps forward, it enables financial firms to change the nature of their business – often into un-regulated territory; Regulators react to pass regulation to catch up. This model can work in slow moving markets, but in todays interconnected […]
One the most enjoyable parts of my job is working with customers and partners who have innovated on the Hortonworks Connected Data Platform. Companies like Servient. Here’s a great real example of a recent use case for a customer we worked together on in the energy vertical. I’ve removed the actual name for obvious reasons. […]
Hadoop’s ability to work with Amazon S3 storage goes back to 2006 and the issue HADOOP-574, “FileSystem implementation for Amazon S3”. This filesystem client, “s3://” implemented an inode-style filesystem atop S3: it could support bigger files than S3 could then support, some its operations (directory rename and delete) were fast. The s3 filesystem allowed Hadoop […]
This is the first of a three part series of the evolution of the Hortonworks and Microsoft relationship. Microsoft has led one tech industry revolution after another from the dawn of personal computing to the cloud. Hortonworks is defining a new generation of innovation and impact with its pioneering work in Big Data. You already […]
Hortonworks Big-Data Maturity Scorecard v2.0 The fourth Industrial revolution is here, and competing to succeed in the 4.0 ‘digital’ world entails making the right decisions based on data driven pointers, to successfully implement your strategy. As we work with the entire stack of Fortune 100 organizations, we often see companies—particularly those operating across business lines […]
Cloud Computing is one of the big three trends impacting IT architectures today. What some may not realize is that an underlying connected data architecture is not only essential for cloud, but sits at the confluence of all three trends. Here’s why. The first big trend is IoT. According to BI Intelligence, we can now […]
The Hadoop community is gathering this week to hear from data scientists, innovators and thought leaders on the state of the data industry. A wide range of topics will be covered, ranging from Hadoop use cases to data visualization and user experience. Customers looking for comprehensive solutions to manage all of their data needs rely […]
In the US fast food industry, this is a common question when you order a burger. ‘You want fries with that?’ It’s in the American psyche at this point, and has become common parlance. I was recently heard this exchange: ‘Hey, can I get a copy of your targeted promos report?’ ‘Sure! You want […]
How Hortonworks can help hotel industry capture value through Insights Aggregation and Predictive Analytics Big Data has transformed every industry including the hospitality vertical. Through customer analytics, targeted segmentation, and campaigning, hotels would like to focus on delivering personalized promotions, cross and up-selling travel services. Our objective is to address these challenges through an open-source […]
My life as part of a high performance team Last week we released Hortonworks DataFlow HDF 2.0. It was a great 1 year anniversary present for me – a new release of the product I’ve been supporting since I joined Hortonworks a year ago. I’ve had the privilege of working with the most talented, quick-thinking, […]
Hortonworks DataFlow (HDF) 2.0 is now available! HDF is powered by Apache NiFi 1.0.0, which recently underwent a major redesign. Whether you’re a current user or just now planning to try it out, this is exciting news. A lot of new feature content went into this release such as multi-tenancy and zero-master clustering. The purpose […]
Last week I had a unique opportunity to present to a group of C-level retail industry leaders. Here are five stories I heard that you might find interesting. These are leaders in Merchandising, Marketing, Infrastructure and IT in top European companies. The common link was dinner and retail. I spoke briefly about my experience in retail and adoption of […]
Hadoop for Health Insurance: The future of the healthcare industry rests in the promise of collecting, analyzing and taking action on the output of larger amounts of information. Through the advancements in big data, machine learning and advanced analytics, healthcare organizations can leverage and manipulate data to improve overall member health, reduce costs, improve quality […]
This guest blog post is from our partner Attunity who is a leading provider of big data management software solutions that enable access, management, sharing and distribution of Big Data, across heterogeneous enterprise platforms, organizations, and the cloud. Carole Gunst, Director of Marketing at Attunity, outlines our partnership and three opportunities to help customers learn […]
As enterprises around the world bring more of their sensitive data into Hadoop data lakes, balancing the need for democratization of access to data without sacrificing strong security principles becomes paramount. According to a recent research report by Securosis, “Hadoop has (mostly) reached security parity with the relational platforms of old, and that’s saying a […]
A lot has been said about Data Lakes over the past five years. The call to action from our industry to customers was to take all your data-at-rest in databases and warehouses, and add to this to the data-in-motion from everything in your ecosystem. Then store all of the resulting terabytes and petabytes in a […]
Apache, Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie, Metron and the Hadoop elephant and Apache project logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States or other countries.