Apache Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Hadoop services provide for data storage, data processing, data access, data governance, security, and operations.
The genesis of Hadoop came from the Google File System paper that was published in October 2003. This paper spawned another research paper from Google – MapReduce: Simplified Data Processing on Large Clusters. Development started in the Apache Nutch project, but was moved to the new Hadoop subproject in January 2006. The first committer added to the Hadoop project was Owen O’Malley in March 2006. Hadoop 0.1.0 was released in April 2006 and continues to be evolved by the many contributors to the Apache Hadoop project. Hadoop was named after one of the founder’s toy elephant.
In 2011, Rob Bearden partnered with Yahoo! to establish Hortonworks with 24 engineers from the original Hadoop team including founders Alan Gates, Arun Murthy, Devaraj Das, Mahadev Konar, Owen O’Malley, Sanjay Radia, and Suresh Srinivas.
Some of the reasons organizations use Hadoop is its’ ability to store, manage and analyze vast amounts of structured and unstructured data quickly, reliably, flexibly and at low-cost.
The Hadoop Distributed File System (HDFS) provides scalable, fault-tolerant, cost-efficient storage for your big data lake. It was designed to span large clusters of commodity servers scaling up to hundreds of petabytes and thousands of servers. By distributing storage across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.
MapReduce is the original framework for writing massively parallel applications that process large amounts of structured and unstructured data stored in HDFS. MapReduce can take advantage of the locality of data, processing it near the place it is stored on each node in the cluster in order to reduce the distance over which it must be transmitted.
More recently, Apache Hadoop YARN opened Hadoop to other data processing engines that can now run alongside existing MapReduce jobs to process data in many different ways at the same time, such as Apache Spark. YARN provides the centralized resource management that enables you to process multiple workloads simultaneously. YARN is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize a modern data architecture.
Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data.
Applications can interact with the data in Hadoop using batch or interactive SQL (Apache Hive) or low-latency access with NoSQL (Apache HBase). Hive allows business users and data analysts to use their preferred business analytics, reporting and visualization tools with Hadoop. Data stored in HDFS in Hadoop can be searched using Apache Solr.
The Hadoop ecosystem extends data access and processing with powerful tools for data governance and integration including centralized security administration (Apache Ranger) and data classification tagging (Apache Atlas), which combined enable dynamic data access policies that proactively prevent data access violations from occurring. Hadoop perimeter security is also available to integrate with existing enterprise security systems and control user access to Hadoop (Apache Knox).
Introduction R is a popular tool for statistics and data analysis. It has rich visualization capabilities and a large collection of libraries that have been developed and maintained by the R developer community. One drawback to R is that it’s designed to run on in-memory data, which makes it unsuitable for large datasets. Spark is […]
A very common request from many customers is to be able to index text in image files; for example, text in scanned PNG files. In this tutorial we are going to walkthrough how to do this with SOLR. Prerequisites Download the Hortonworks Sandbox Complete the Learning the Ropes of the HDP Sandbox tutorial. Step-by-step guide […]
Introduction Hadoop has always been associated with BigData, yet the perception is it’s only suitable for high latency, high throughput queries. With the contribution of the community, you can use Hadoop interactively for data exploration and visualization. In this tutorial you’ll learn how to analyze large datasets using Apache Hive LLAP on Amazon Web Services […]
Introduction The Azure cloud infrastructure has become a common place for users to deploy virtual machines on the cloud due to its flexibility, ease of deployment, and cost benefits. Microsoft has expanded Azure to include a marketplace with thousands of certified, open source, and community software applications and developer services, pre-configured for Microsoft Azure. This […]
Introduction Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs in Scala, Java, Python, and R that allow developers to execute a variety of data intensive workloads. In this tutorial, we will use an Apache Zeppelin notebook for our development environment to keep things simple and elegant. Zeppelin will […]
In this tutorial we will explore how you can use policies in HDP Advanced Security to protect your enterprise data lake and audit access by users to resources on HDFS, Hive and HBase from a centralized HDP Security Administration Console.
This tutorial will cover the core concepts of Storm and the role it plays in an environment where real-time, low-latency and distributed data processing is important.
Last month, at our DataWorks Summit in San Jose, we were treated to a host of keynote speakers from a variety of industries, all leveraging the power of Big Data and Hortonworks to usher in groundbreaking, and even life-saving, transformation to their respective industries. One of the speakers was Dr. Wade Schulz, Resident Physician and […]
Tuesday night I had the opportunity to visit Hacker Lab, where the Sacramento Women in Data Science hosted Dr. Ian Brooks, one of our super-talented Solutions Engineers. SWDS has over 500 members and regularly hosts data science events to help grow the data science and analytics community for, but not exclusive to, women. Groups such as […]
We’ve just published our most recent customer case study! This one comes from straight from Japan and is emblematic of the trend towards connected data architectures throughout the North Asian region, and Hortonworks’ continued growth in the same. Mitsubishi Fuso Truck and Bus Corporation (Mitsubishi Fuso) is a leading manufacturer of trucks, buses, and industrial […]
Did you catch the DataWorks Summit last week? Wedged between our huge announcements and the fumes from the t-shirt press, the Summit featured an array of keynote speakers from top companies across many industries. One of the keynote presenters was Gowri Selka, Head of Data Analytics and Corporate Technology at Walgreens Boots Alliance. Gowri spoke about […]
Klarna is a leading e-payment platform in Europe. The company provides payment services for online storefronts and handles large amounts of data every minute. This data enables them to roll out new products and fine-tune existing products to ensure seamless buying experiences for its customers. Today it was reported that this Swedish payments startup, valued […]
The path to a successful big data implementation isn’t straightforward. There are many decisions and considerations along the way – from technology, to people to process – all these need to come together for a successful outcome. Hortonworks is in the business of helping enterprises achieve their desired business outcomes with big data as effectively […]
Customer service as a car insurance provider comes with the added challenge of constantly ingesting and storing data of all types. Not only that, but a large part of marketing to new customers, while remaining committed to existing customers, is effectively utilizing data from multiple sources. Progressive Insurance is one of the largest U.S. auto […]
The critical business challenge for healthcare organizations is to effectively manage their data. Success means access to real-time market data, data visualization, and cost-saving opportunities. Data virtualization and predictive analytics further improve both the business side of healthcare organizations, and patient care. At San Jose DataWorks Summit (June 13-15), Vizient will show how predictive analytics helps connect members […]
The decision-making process for a customer to buy products in the retail space can range from days to seconds. The spontaneous buying patterns among consumers creates a business challenge for retailers to address their data needs just as quickly, otherwise customers will go elsewhere. When you combine a full pharmacy to the needs of a […]
Thank you for reading our Data Lake 3.0 series! We are encouraged by the positive responses to our blogs (part 1, part 2, part 3, part 4, part 5). In Data Lake 3.0, we are envisioning a large data lake shared between multiple tenants and dockerized applications (ranging from real-time to batch). In a shared […]
The San Jose DataWorks Summit (June 13-15) is next week! We have an impressive lineup of keynote and breakout speakers. This year our Enterprise Adoption Track will feature Chris Dingle, Sr. Director, Customer Intelligence, at Rogers Communications. Rogers Communications is one of the largest Canadian communications and media companies. Headquartered in Toronto, Rogers is data-driven and […]
In January we announced the Hortonworks Data Heroes initiative. It’s our way of recognizing the Data Visionaries, Data Scientists, and Data Architects transforming their businesses and organizations through Big Data. Hortonworks has over 1000 customers ranging across every industry. There are so many Big Data stories to be told: stories about transformation, cost reduction, and […]
Apache Spark 2.1 Improves in Structured Streaming and Machine Learning. Structured Streaming: Kafka .10 support, Metrics & Stability improvements Machine Learning: SparkR Improvements including new ML algorithms for LDA, Random forests, GMM, etc. The recent release of Hortonworks Data Platform 2.6 (“HDP 2.6”) includes Apache Spark 2.1. And Hortonworks Data Cloud (“HDCloud”) for AWS gives […]
Yesterday we announced that Mitsubishi Fuso Truck and Bus Corporation has deployed Microsoft Azure HDInsight, powered by Hortonworks Data Platform (HDP ®), in the public cloud to power the company’s connected data architecture. Notably, “Mitsubishi Fuso’s big data strategy began in 2014 and since then the company has undergone a process to modernize all operations. […]
The San Jose DataWorks Summit (June 13-15) is just a few weeks away! We’re busy finalizing the lineup of an impressive array of speakers and business use cases. This year our Data Processing & Warehouse Track will feature Daniel Sumners, IT Architect at CenterPoint Energy. CenterPoint Energy is a Fortune 500 electric and gas utility company operating in several […]
Simon Meredith, Chief Technology Officer – CSI, IBM Europe explains the significance of IBM & Hortonworks working together in the era of Big Data What is fuelling IBM’s commitment to Apache Hadoop and Spark? The pressures of day to day business are delaying companies doing more with their data. IBM’s commitment is to initiate, simplify […]
Danske Bank, headquartered in Copenhagen, is the largest bank in Denmark. It’s also one of the major retail banks in the northern European region, with over 5 million retail customers. Data is mission critical to Danske Bank as it provides them with actionable intelligence to help minimize risk and maximize opportunities. In our latest video, […]
Destination Autonomous The march towards autonomous vehicles continues to accelerate. While expert opinion differs on the specific timing and use cases that will emerge first, few deny that self-driving cars are in our future. Not surprisingly, when reviewing Big Data strategies with my automotive clients, discussions on data management strategies for autonomous driving research inevitably […]
Clearsense, based in Jacksonville, Florida, develops cloud-based applications based upon Hortonworks 100% open-source Connected Data Platforms. Its customers are hospitals and health systems, and its mission is to save people’s lives by giving providers and medical practitioners advanced notice of a patient’s deteriorating health. Clearsense’s flagship product, Inception, is “designed specifically for the needs of […]
Thank you for reading our Data Lake 3.0 series! In part 1 of the series, we introduced what a Data Lake 3.0 is. In part 2 of the series, we talked about how a multi-colored YARN will play a critical role in building a successful Data Lake 3.0. In part 3 of the series, […]
Carolinas HealthCare System is one of the leading healthcare organizations in the Southeast and one of the most comprehensive, not-for-profit systems in the country. Our more than 900 care locations include: Academic medical centers Hospitals Freestanding emergency departments Healthcare pavilions Physician practices Outpatient surgical centers Laboratories Rehabilitation centers Home health agencies Nursing homes Hospice and […]
With the San Jose DataWorks Summit (June 13-15) just two months away, we’re busy finalizing the lineup of an impressive array of speakers and business use cases. This year our Enterprise Adoption Track will feature Jay Etchings, Director of Operations for Research Computing at Arizona State University. In February we announced Jay’s new book, “Strategies in Biomedical Data […]
Apache Spark is a powerful framework for data processing and analysis. Spark provides two modes for data exploration: Interactive: provided by spark-shell, pySpark, and SparkR REPLs Batch: using spark-submit to submit a Spark application to cluster without interaction in the middle of run-time. While these two modes look different on the surface, deep down they […]
You have heard about Big Data for a long time, and how companies that use Big Data as part of their business decision making process experience significantly higher profitability than their competition. Now that your company is ready to embark on its first Apache Hadoop® journey there are important lessons to be learned. Read on […]
Hive View 2.0 is New in Apache Ambari 2.5 Ambari’s Hive View gives analysts and DBAs a convenient web interface to Apache Hive which allows SQL analytics, data management and performance diagnostics. Ambari 2.5 introduces Hive View 2.0 with a brand new user experience plus a slew of great new tools to help DBAs run […]
Apache, Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie, Phoenix, NiFi, HAWQ, Zeppelin, Atlas, Slider, Mahout, MapReduce, HDFS, YARN, Metron and the Hadoop elephant and Apache project logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States or other countries.