Apache Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Hadoop services provide for data storage, data processing, data access, data governance, security, and operations.
The genesis of Hadoop came from the Google File System paper that was published in October 2003. This paper spawned another research paper from Google – MapReduce: Simplified Data Processing on Large Clusters. Development started in the Apache Nutch project, but was moved to the new Hadoop subproject in January 2006. The first committer added to the Hadoop project was Owen O’Malley in March 2006. Hadoop 0.1.0 was released in April 2006 and continues to be evolved by the many contributors to the Apache Hadoop project. Hadoop was named after one of the founder’s toy elephant.
In 2011, Rob Bearden partnered with Yahoo! to establish Hortonworks with 24 engineers from the original Hadoop team including founders Alan Gates, Arun Murthy, Devaraj Das, Mahadev Konar, Owen O’Malley, Sanjay Radia, and Suresh Srinivas.
Some of the reasons organizations use Hadoop is its’ ability to store, manage and analyze vast amounts of structured and unstructured data quickly, reliably, flexibly and at low-cost.
The Hadoop Distributed File System (HDFS) provides scalable, fault-tolerant, cost-efficient storage for your big data lake. It was designed to span large clusters of commodity servers scaling up to hundreds of petabytes and thousands of servers. By distributing storage across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.
MapReduce is the original framework for writing massively parallel applications that process large amounts of structured and unstructured data stored in HDFS. MapReduce can take advantage of the locality of data, processing it near the place it is stored on each node in the cluster in order to reduce the distance over which it must be transmitted.
More recently, Apache Hadoop YARN opened Hadoop to other data processing engines that can now run alongside existing MapReduce jobs to process data in many different ways at the same time, such as Apache Spark. YARN provides the centralized resource management that enables you to process multiple workloads simultaneously. YARN is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize a modern data architecture.
Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data.
Applications can interact with the data in Hadoop using batch or interactive SQL (Apache Hive) or low-latency access with NoSQL (Apache HBase). Hive allows business users and data analysts to use their preferred business analytics, reporting and visualization tools with Hadoop. Data stored in HDFS in Hadoop can be searched using Apache Solr.
The Hadoop ecosystem extends data access and processing with powerful tools for data governance and integration including centralized security administration (Apache Ranger) and data classification tagging (Apache Atlas), which combined enable dynamic data access policies that proactively prevent data access violations from occurring. Hadoop perimeter security is also available to integrate with existing enterprise security systems and control user access to Hadoop (Apache Knox).
Introduction This is the third tutorial in a series about building and deploying machine learning models with Apache Nifi and Spark. In Part 1 of the series we learned how to use Nifi to ingest and store Twitter Streams. In Part 2 we ran Spark from a Zeppelin notebook to design a machine learning model […]
Introduction This tutorial will teach you how to build sentiment analysis algorithms with Apache Spark. We will be doing data transformation using Scala and Apache Spark 2, and we will be classifying tweets as happy or sad using a Gradient Boosting algorithm. Although this tutorial is focused on sentiment analysis, Gradient Boosting is a versatile […]
Introduction This tutorial will teach you how to set up a full development environment for developing and debugging Spark applications. For this tutorial we’ll be using Java, but Spark also supports development with Java, Python, and R. The Scala version of this tutorial can be found here, and the Python version here. We’ll be using […]
Introduction This tutorial will teach you how to set up a full development environment for developing and debugging Spark applications. For this tutorial we’ll be using Python, but Spark also supports development with Java, Python, and R. The Scala version of this tutorial can be found here, and the Java version here. We’ll be using […]
Introduction This tutorial will teach you how to set up a full development environment for developing and debugging Spark applications. For this tutorial we’ll be using Scala, but Spark also supports development with Java, Python, and R. The Java version of this tutorial can be found here, and the Python version here. We’ll be using […]
A very common request from many customers is to be able to index text in image files; for example, text in scanned PNG files. In this tutorial we are going to walkthrough how to do this with SOLR. Prerequisites Download the Hortonworks Sandbox Complete the Learning the Ropes of the HDP Sandbox tutorial. Step-by-step guide […]
Introduction R is a popular tool for statistics and data analysis. It has rich visualization capabilities and a large collection of libraries that have been developed and maintained by the R developer community. One drawback to R is that it’s designed to run on in-memory data, which makes it unsuitable for large datasets. Spark is […]
Introduction The Azure cloud infrastructure has become a common place for users to deploy virtual machines on the cloud due to its flexibility, ease of deployment, and cost benefits. Microsoft has expanded Azure to include a marketplace with thousands of certified, open source, and community software applications and developer services, pre-configured for Microsoft Azure. This […]
Apache Spark is a powerful framework for data processing and analysis. Spark provides two modes for data exploration: Interactive: provided by spark-shell, pySpark, and SparkR REPLs Batch: using spark-submit to submit a Spark application to cluster without interaction in the middle of run-time. While these two modes look different on the surface, deep down they […]
You have heard about Big Data for a long time, and how companies that use Big Data as part of their business decision making process experience significantly higher profitability than their competition. Now that your company is ready to embark on its first Apache Hadoop® journey there are important lessons to be learned. Read on […]
Hive View 2.0 is New in Apache Ambari 2.5 Ambari’s Hive View gives analysts and DBAs a convenient web interface to Apache Hive which allows SQL analytics, data management and performance diagnostics. Ambari 2.5 introduces Hive View 2.0 with a brand new user experience plus a slew of great new tools to help DBAs run […]
Danske Bank, headquartered in Copenhagen, is the largest bank in Denmark. It’s also one of the major retail banks in the northern European region, with over 5 million retail customers. Danske Bank is leveraging Hortonworks for actionable intelligence to help minimize risk and maximize opportunities. Three weeks ago, at the DataWorks Summit in Munich, we announced […]
Three weeks ago, at the DataWorks Summit in Munich, we announced the Data Hero winners for the EMEA region. The winner in the Data Visionary category was Daljit Rehal, Global Director, Digital & Data Services at Centrica. You can read the announcement here. Centrica supplies energy and energy-related services to around 28 million customer accounts in […]
Andrew Ng, the renowned data scientist, has said that artificial intelligence (AI) needs to be a company-wide strategic decision. Companies that don’t strategically invest in AI will slowly lose market share to companies whose core businesses are built around AI. AI enables the prediction, planning and automation of a variety of tasks, and for enterprises, […]
R is one of the primary programming languages for data science with more than 10,000 packages. R is an open source software that is widely taught in colleges and universities as part of statistics and computer science curriculum. R uses data frame as the API which makes data manipulation convenient. R has powerful visualization infrastructure, […]
Apache Hadoop has always been associated with storing & processing vasts amount of data. But did you know it’s also an awesome engine to power interactive data exploration and visualization? With the development of Apache Hive LLAP (a recent innovation included in the Hortonworks Data Platform), you can use Hadoop with Business Intelligence tools (like […]
OPEN SOURCE HADOOP NOW RUNS ON AN OPEN COMPUTE PLATFORM The software market is undergoing a major transition, moving away from proprietary software that leads to customer lock-in. Open source software offers freedom, more flexibility, and faster innovation – all at a lower cost. With the release of HDP 2.6 now available on IBM Power Systems, […]
HDP 2.6 takes a huge step forward toward true data management by introducing SQL-standard ACID Merge to Apache Hive. As scalable as Apache Hadoop is, many workloads don’t work well in the Hadoop environment because they need frequent or unpredictable updates. Updates using hand-written Apache Hive or Apache Spark jobs are extremely complex. Not only […]
We are thrilled to announce that Hortonworks Data Platform (HDP) version 2.6 is now available – both on pre-premise and in the cloud. For the first time, we are also making this available on IBM Power System in addition to the x86 chipset. During 2016, we have seen many of Hortonworks’ customers deploy more and […]
Human Assisted AI Another common trend is pairing humans to evaluate results from Artificial Intelligence (AI). As great and sensational AI has been made out to be recently, it is still long way from having human-like abilities of comprehension, reasoning and intuition. For instance, in radiology, given lymph node cells, AI alone had 7.5 percent […]
Large-scale Machine Learning The ability to learn without being explicitly programmed, Machine Learning, has been around for a long time and is well understood. What is different is the relatively recent emergence of general purpose tools, such as Apache Spark, that enable processing of very large datasets. Additionally, data scientists can now collaborate and rapidly […]
One of the best parts about my job is learning how Big Data drives the world around us. I’m continually awed by the plethora of transformative customer stories and Big Data use cases across every industry. Take for instance Soleo Communications. Soleo bridges the space between the world of telephony and the world of Modern […]
Hortonworks continues to expand its list of customers in the Asia Pacific region, as well as in the housing and building industry. We recently completed a case study to showcase how LIXIL Corporation uses HDP to be first in manufacturing for the Japanese Smart Home Market. READ THE FULL LIXIL CASE STUDY HERE LIXIL is a […]
Did you know every Hortonworks HDP support subscription comes with SmartSense? Advanced Analytics of Diagnostic Data Prevents Issues SmartSense uses advanced analytics to make suggestions and recommendations based on the deep knowledge of our Hortonworks engineers and committers to prevent issues and improve performance of your HDP cluster. Based on the diagnostic data collected from […]
Thank you for reading our Data Lake 3.0 series! In part 1 of the series, we introduced what a Data Lake 3.0 is. In part 2 of the series, we talked about how a multi-colored YARN will play a critical role in building a successful Data Lake 3.0. In part 3 of the series, we […]
Syncsort and Hortonworks working together to drive the success of a modern EDW solution Enterprise Data Warehouse has become a standard component of the corporate data architecture. In the past 15 years, a variety of product offerings were introduced into the market on building EDWs, operational data stores, real-time Data Warehouses. The differences is the […]
We are now all accustomed to pundits and observers all over the world boldly proclaiming that data is the currency of the digital age. But if everyone does it then where will my competitive advantage come from? Well, one way could be by being faster, better, and cheaper than the rest. That is how previous […]
Last week I was in Barcelona for Mobile World Congress. As every year, it is a big event that gathers the greatest collection of companies and people in the greater mobile industry. This year was no exception as MWC was the meeting grounds for our industry and an opportunity to see what’s ahead. Although there […]
Data is the currency for a digital transformation is a theme that came out loud and clear during last week’s Gartner Data and Analytics Summit. This event for the first time brought together two popular conferences – Analytics and Master Data Management (MDM). The result was a very enjoyable time with tons of great conversations and […]
Thank you for reading our Data Lake 3.0 series! In part 1 of the series, we introduced what a Data Lake 3.0 is and in part 2 of the series, we talked about how a multi-colored YARN will play a critical role in building a successful Data Lake 3.0. In this blog, we will take a […]
Next week (March 6 – 9) Gartner will host their annual Data and Analytics Summit in Grapevine, TX. This is where analysts from Gartner, vendors and many leaders of businesses of all sizes all get together and talk about data and analytics. Personally, I have not attended the conference for the past few years, but […]
The new year brings new innovation and collaborative efforts. Various teams from the Apache community have been working hard for the last eighteen months to bring the EZ button to Apache Hadoop technology and Data Lake. In the coming months, we will publish a series of blogs introducing our Data Lake 3.0 architecture and highlighting […]
Hortonworks has achieved quite a bit of success with online dating. Personally, I haven’t just yet, but hey it warms my heart to think about all those that we’ve helped bring together. Valentine’s Day is upon us and so I wanted to launch this cupid’s arrow with a missive about how Hortonworks Data Platform (HDP) […]
Apache, Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie, Metron and the Hadoop elephant and Apache project logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States or other countries.