Apache Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Hadoop services provide for data storage, data processing, data access, data governance, security, and operations.
The genesis of Hadoop came from the Google File System paper that was published in October 2003. This paper spawned another research paper from Google – MapReduce: Simplified Data Processing on Large Clusters. Development started in the Apache Nutch project, but was moved to the new Hadoop subproject in January 2006. The first committer added to the Hadoop project was Owen O’Malley in March 2006. Hadoop 0.1.0 was released in April 2006 and continues to be evolved by the many contributors to the Apache Hadoop project. Hadoop was named after one of the founder’s toy elephant.
In 2011, Rob Bearden partnered with Yahoo! to establish Hortonworks with 24 engineers from the original Hadoop team including founders Alan Gates, Arun Murthy, Devaraj Das, Mahadev Konar, Owen O’Malley, Sanjay Radia, and Suresh Srinivas.
Some of the reasons organizations use Hadoop is its’ ability to store, manage and analyze vast amounts of structured and unstructured data quickly, reliably, flexibly and at low-cost.
The Hadoop Distributed File System (HDFS) provides scalable, fault-tolerant, cost-efficient storage for your big data lake. It was designed to span large clusters of commodity servers scaling up to hundreds of petabytes and thousands of servers. By distributing storage across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.
MapReduce is the original framework for writing massively parallel applications that process large amounts of structured and unstructured data stored in HDFS. MapReduce can take advantage of the locality of data, processing it near the place it is stored on each node in the cluster in order to reduce the distance over which it must be transmitted.
More recently, Apache Hadoop YARN opened Hadoop to other data processing engines that can now run alongside existing MapReduce jobs to process data in many different ways at the same time, such as Apache Spark. YARN provides the centralized resource management that enables you to process multiple workloads simultaneously. YARN is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize a modern data architecture.
Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data.
Applications can interact with the data in Hadoop using batch or interactive SQL (Apache Hive) or low-latency access with NoSQL (Apache HBase). Hive allows business users and data analysts to use their preferred business analytics, reporting and visualization tools with Hadoop. Data stored in HDFS in Hadoop can be searched using Apache Solr.
The Hadoop ecosystem extends data access and processing with powerful tools for data governance and integration including centralized security administration (Apache Ranger) and data classification tagging (Apache Atlas), which combined enable dynamic data access policies that proactively prevent data access violations from occurring. Hadoop perimeter security is also available to integrate with existing enterprise security systems and control user access to Hadoop (Apache Knox).
A very common request from many customers is to be able to index text in image files; for example, text in scanned PNG files. In this tutorial we are going to walkthrough how to do this with SOLR. Prerequisites Download the Hortonworks Sandbox Complete the Learning the Ropes of the HDP Sandbox tutorial. Step-by-step guide […]
Introduction R is a popular tool for statistics and data analysis. It has rich visualization capabilities and a large collection of libraries that have been developed and maintained by the R developer community. One drawback to R is that it’s designed to run on in-memory data, which makes it unsuitable for large datasets. Spark is […]
Introduction Hadoop has always been associated with BigData, yet the perception is it’s only suitable for high latency, high throughput queries. With the contribution of the community, you can use Hadoop interactively for data exploration and visualization. In this tutorial you’ll learn how to analyze large datasets using Apache Hive LLAP on Amazon Web Services […]
This tutorial will cover the core concepts of Storm and the role it plays in an environment where real-time, low-latency and distributed data processing is important.
Introduction Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs in Scala, Java, Python, and R that allow developers to execute a variety of data intensive workloads. In this tutorial, we will use an Apache Zeppelin notebook for our development environment to keep things simple and elegant. Zeppelin will […]
In this tutorial we will explore how you can use policies in HDP Advanced Security to protect your enterprise data lake and audit access by users to resources on HDFS, Hive and HBase from a centralized HDP Security Administration Console.
Introduction Apache Ranger delivers a comprehensive approach to security for a Hadoop cluster. It provides a central security policy administration across the core enterprise security requirements of authorization, accounting and data protection. Apache Ranger already extends baseline features for coordinated enforcement across Hadoop workloads from batch, interactive SQL and real–time in Hadoop. In this tutorial, […]
With today’s new rapid pace, speed to market is a huge factor for any business. The faster a company can gain insights from their data, the better they can serve their customers. If changes aren’t made quickly enough, there’s a significant risk of losing customers and market share. One example of gaining faster insights from […]
Before making important business decisions, it’s crucial for a company to see a complete picture of what’s going on. To do this, they need to gather the necessary data to make the most informed decision possible. One example of a company who is using data to get a complete picture of their business is DHISCO. […]
Last Friday, I wrote about how TMW Systems leverages Hortonworks to crack the last mile problem. TMW, the leading transportation technology provider, consolidates the data of their carrier customers to deliver fresh analytical insights to these small carriers, thereby giving them the legs to walk the last mile. TMW’s use case also highlights the importance […]
The critical business challenge for healthcare organizations is to effectively manage their data. Success means access to real-time market data, data visualization, and cost-saving opportunities. Data virtualization and predictive analytics further improve both the business side of healthcare organizations, and patient care. At San Jose DataWorks Summit (June 13-15), Vizient will show how predictive analytics helps connect members […]
Yesterday we announced that Mitsubishi Fuso Truck and Bus Corporation has deployed Microsoft Azure HDInsight, powered by Hortonworks Data Platform (HDP ®), in the public cloud to power the company’s connected data architecture. Notably, “Mitsubishi Fuso’s big data strategy began in 2014 and since then the company has undergone a process to modernize all operations. […]
Danske Bank, headquartered in Copenhagen, is the largest bank in Denmark. It’s also one of the major retail banks in the northern European region, with over 5 million retail customers. Data is mission critical to Danske Bank as it provides them with actionable intelligence to help minimize risk and maximize opportunities. In our latest video, […]
With the San Jose DataWorks Summit (June 13-15) just two months away, we’re busy finalizing the lineup of an impressive array of speakers and business use cases. This year our Enterprise Adoption Track will feature Jay Etchings, Director of Operations for Research Computing at Arizona State University. In February we announced Jay’s new book, “Strategies in Biomedical Data […]
Hortonworks continues to expand its list of customers in the Asia Pacific region, as well as in the housing and building industry. We recently completed a case study to showcase how LIXIL Corporation uses HDP to be first in manufacturing for the Japanese Smart Home Market. READ THE FULL LIXIL CASE STUDY HERE LIXIL is a […]
As we kick off the new year I wanted to thank our customers, partners, Apache community members, and of course the amazing Hortonworks team, for an amazing 2016. Let’s take a step back and look at some of the Hortonworks highlights from last year… IN THE ECOSYSTEM there was tremendous acceleration. At the beginning of […]
“Banking as a service has long sat at the heart of our economy. In our digitally enabled world, the need to seamlessly and efficiently connect different economic agents who are buying and selling goods and services, is critical. The Open Banking Standard is a framework for making banking data work better: for customers; for businesses […]
It’s no secret that there is a data explosion. A recent IDC analyst report from April 2014 indicated the volume of data, known as the digital universe, is doubling in size every two years. And by 2020, there will be as many digital bits as there are stars in the universe. There are many reasons […]
Guest author: Jeff Kelly, Data Strategist, Pivotal The phrase “digital transformation” gets bandied about a lot these days, but what exactly does it mean? When you strip away the hyperbole, I believe digital transformation is the process by which enterprises evolve from using traditional information technology to merely support existing business models to adopting modern […]
People often think about cloud architecture in simplistic terms: you’re either public, private, or hybrid. (In fact, there’s even confusion about the meaning of the term “hybrid” itself—this video helps clear it up: In the real world, of course, virtually every implementation is hybrid—no company puts 100% of its IT environment into one single cloud. […]
The 100% open source and community driven innovation of Apache Hive 2.0 and LLAP (Long Last and Process) truly brings agile analytics to the next level. It enables customers to perform sub-second interactive queries without the need for additional SQL-based analytical tools, enabling rapid analytical iterations and providing significant time-to-value. TRY HIVE LLAP TODAY Read about […]
Cloud Computing is one of the big three trends impacting IT architectures today. What some may not realize is that an underlying connected data architecture is not only essential for cloud, but sits at the confluence of all three trends. Here’s why. The first big trend is IoT. According to BI Intelligence, we can now […]
How Hortonworks can help hotel industry capture value through Insights Aggregation and Predictive Analytics Big Data has transformed every industry including the hospitality vertical. Through customer analytics, targeted segmentation, and campaigning, hotels would like to focus on delivering personalized promotions, cross and up-selling travel services. Our objective is to address these challenges through an open-source […]
My life as part of a high performance team Last week we released Hortonworks DataFlow HDF 2.0. It was a great 1 year anniversary present for me – a new release of the product I’ve been supporting since I joined Hortonworks a year ago. I’ve had the privilege of working with the most talented, quick-thinking, […]
You may have seen our invite to join the genomics consortium Let me recap a little about what this is about and catch you up to speed on our progress and next steps. Hortonworks is quarterbacking a consortium of leading healthcare organizations and subject matter experts to help develop the platform requirements for next generation […]
We are pleased to announce the latest release of Apache Ambari 2.4 which further simplifies Hadoop Operations. With Ambari 2.4 (which is part of the recently released Hortonworks Data Platform 2.5, enterprises can plan, install and securely configure the Hortonworks Data Platform and easily provide ongoing maintenance and management. This new release includes an integrated […]
Recent industry research by both Strategy Meets Action (SMA) and Novarica highlights analytics as the top priority for the insurance industry. Further, the Insurers’ 2016 Strategic Initiatives: Advancing Industry Transformation report by SMA identified customer engagement as another top priority for insurers. Success in the insurance industry depends on your company’s ability to quickly interact […]
Today we announced Microsoft Azure HDInsight as our Premier Connected Data Platforms cloud solution, providing customers Apache™ Hadoop® as a fully managed cloud service. The announcement is very timely as this week, Hortonworks and Microsoft are celebrating the 10th anniversary of Hadoop at the Hadoop Summit 2016. Highlighting HDInsight as our premier cloud offering helps […]
This is part two of a two-part series from Hadoop Summit. In his post, Rob Beardon talks about how data transforms everything and the need for Connected Data Platforms. As a follow on, here’s four predictions for technologies behind this transformation. #1 — Intelligent Self-Configuring Networks Will Enable New & Faster Delivery of Data and Analytics Across Data […]
June has got off to a great start – and not only because it seems like summer has arrived in London! Yesterday, our team gathered in our International HQ a stone’s throw from Liverpool Street Station for a session with Mike Schiebel, our cyber security strategist, who is visiting from the west coast. We have […]
According to Gartner Research, by 2020 the total number of connected cars will be nine times more than that of 2015. Additionally, 80% of all new vehicles will have data connectivity, 30% of connected-vehicles will have built-in, over-the-air software capabilities, and over one billion connected automotive subsystems will be shipped. With the exponential growth of […]
The world’s top authorities on Apache Hadoop convene at Hadoop Summit San Jose and one of the top questions that will be answered will be around the future and direction of Hadoop. Sanjay Radia – Founder and Architect, Hortonworks lead the track which selected 13 sessions around this topic. I asked Sanjay what he hoped would […]
Apache, Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie, Phoenix, NiFi, HAWQ, Zeppelin, Atlas, Slider, Mahout, MapReduce, HDFS, YARN, Metron and the Hadoop elephant and Apache project logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States or other countries.