Apache Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Hadoop services provide for data storage, data processing, data access, data governance, security, and operations.
The genesis of Hadoop came from the Google File System paper that was published in October 2003. This paper spawned another research paper from Google – MapReduce: Simplified Data Processing on Large Clusters. Development started in the Apache Nutch project, but was moved to the new Hadoop subproject in January 2006. The first committer added to the Hadoop project was Owen O’Malley in March 2006. Hadoop 0.1.0 was released in April 2006 and continues to be evolved by the many contributors to the Apache Hadoop project. Hadoop was named after one of the founder’s toy elephant.
In 2011, Rob Bearden partnered with Yahoo! to establish Hortonworks with 24 engineers from the original Hadoop team including founders Alan Gates, Arun Murthy, Devaraj Das, Mahadev Konar, Owen O’Malley, Sanjay Radia, and Suresh Srinivas.
Some of the reasons organizations use Hadoop is its’ ability to store, manage and analyze vast amounts of structured and unstructured data quickly, reliably, flexibly and at low-cost.
The Hadoop Distributed File System (HDFS) provides scalable, fault-tolerant, cost-efficient storage for your big data lake. It was designed to span large clusters of commodity servers scaling up to hundreds of petabytes and thousands of servers. By distributing storage across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.
MapReduce is the original framework for writing massively parallel applications that process large amounts of structured and unstructured data stored in HDFS. MapReduce can take advantage of the locality of data, processing it near the place it is stored on each node in the cluster in order to reduce the distance over which it must be transmitted.
More recently, Apache Hadoop YARN opened Hadoop to other data processing engines that can now run alongside existing MapReduce jobs to process data in many different ways at the same time, such as Apache Spark. YARN provides the centralized resource management that enables you to process multiple workloads simultaneously. YARN is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize a modern data architecture.
Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data.
Applications can interact with the data in Hadoop using batch or interactive SQL (Apache Hive) or low-latency access with NoSQL (Apache HBase). Hive allows business users and data analysts to use their preferred business analytics, reporting and visualization tools with Hadoop. Data stored in HDFS in Hadoop can be searched using Apache Solr.
The Hadoop ecosystem extends data access and processing with powerful tools for data governance and integration including centralized security administration (Apache Ranger) and data classification tagging (Apache Atlas), which combined enable dynamic data access policies that proactively prevent data access violations from occurring. Hadoop perimeter security is also available to integrate with existing enterprise security systems and control user access to Hadoop (Apache Knox).
A very common request from many customers is to be able to index text in image files; for example, text in scanned PNG files. In this tutorial we are going to walkthrough how to do this with SOLR. Prerequisites Download the Hortonworks Sandbox Complete the Learning the Ropes of the HDP Sandbox tutorial. Step-by-step guide […]
Introduction When you deploy virtual machines on Azure, a good practice is to set up Azure Network Security Groups (NSG) to minimize the exposure of endpoints and limit access to those endpoints to only known IPs from the Internet. In order to access the rest of the endpoints in your Virtual Network (VNet) on Azure, […]
Introduction The Azure cloud infrastructure has become a common place for users to deploy virtual machines on the cloud due to its flexibility, ease of deployment, and cost benefits. Microsoft has expanded Azure to include a marketplace with thousands of certified, open source, and community software applications and developer services, pre-configured for Microsoft Azure. This […]
In this tutorial we will explore how you can use policies in HDP Advanced Security to protect your enterprise data lake and audit access by users to resources on HDFS, Hive and HBase from a centralized HDP Security Administration Console.
Introduction Apache Ranger delivers a comprehensive approach to security for a Hadoop cluster. It provides a central security policy administration across the core enterprise security requirements of authorization, accounting and data protection. Apache Ranger already extends baseline features for coordinated enforcement across Hadoop workloads from batch, interactive SQL and real–time in Hadoop. In this tutorial, […]
Overview Apache Ambari is a completely open operational framework for provisioning, managing and monitoring Apache Hadoop clusters. Ambari includes an intuitive collection of operator tools and a set of APIs that mask the complexity of Hadoop, simplifying the operation of clusters. In this tutorial, we will walk through the some of the key aspects of […]
Introduction Hortonworks has recently announced the integration of Apache Atlas and Apache Ranger, and introduced the concept of tag or classification based policies. Enterprises can classify data in Apache Atlas and use the classification to build security policies in Apache Ranger. This tutorial walks through an example of tagging data in Atlas and building a […]
This tutorial will help you get started with Hadoop and HDP. We will use an Internet of Things (IoT) use case to build your first HDP application.
The new year brings new innovation and collaborative efforts. Various teams from the Apache community have been working hard for the last eighteen months to bring the EZ button to Apache Hadoop technology and Data Lake. In the coming months, we will publish a series of blogs introducing our Data Lake 3.0 architecture and highlighting […]
Hortonworks has achieved quite a bit of success with online dating. Personally, I haven’t just yet, but hey it warms my heart to think about all those that we’ve helped bring together. Valentine’s Day is upon us and so I wanted to launch this cupid’s arrow with a missive about how Hortonworks Data Platform (HDP) […]
This is the final post in a series of four posts on the implications of the Open Banking Standard (OBS) in the UK. The first post introduced the specification (http://hortonworks.com/blog/banking-innovation-uk-open-bank-project/). The second post (http://hortonworks.com/blog/business-implications-uk-open-bank-standard/) examined the business implications of the specification. The third examined the strategic drivers for incumbents to drive change in their platforms to achieve OBS […]
We are pleased to announce the latest release of Hortonworks Data Cloud for AWS. This release (version 1.11 for those that are keeping score) continues to drive towards the goal of making data processing easy and cost effective in the cloud. For those that aren’t familiar with Hortonworks Data Cloud for AWS (or “HDCloud” for […]
Today, Hortonworks announced the Hortonworks EDW Optimization Solution to help extend and accelerate return on investment for business intelligence e.g. the data warehouse. The solution brings together technologies from Hortonworks and partners Syncsort and AtScale. But before I dig into the details of this solution it is worth understanding the vision Hortonworks is revealing here. […]
You may have noticed our new homepage banner, “Be a Hero”, and thought to yourself, “I’ve been waiting for a radioactive spider bite all my life.” I have some goods news for you, and no, it’s not the hairy spider living under your desk. There’s now an easier way to become a hero. At Hortonworks, […]
Apache Spark 2.1 was released recently in the community. The main focus of this release was improvements in Structured Streaming and Machine Learning. Structured Streaming: Kafka .10 support, Metrics & Stability improvements Machine Learning: SparkR Improvements including new ML algorithms for LDA, Random forests, GMM, etc. Wanna try Spark 2.1 now? Well, you are in […]
Last year Hadoop celebrated it’s tenth birthday, young in the land of data technologies. But the growth in popularity of Apache Hadoop is not slowing down anytime soon. In fact, results from the 2016 Big Data Maturity Survey indicates 97% of respondents plan to do more big data initiatives in the next 3 months. The […]
The NRF Big Show is here and it’s no surprise that retail data analytics are a hot topic. It’s an exciting time for retailers as we continue to discover the power of data to improve our ability to personalize the customer experience, drive brand loyalty and increase sales. Two key trends are emerging – retailers […]
We are very excited to be bringing you DataWorks Summit/Hadoop Summit this year. It’s the industry’s premier event focusing on next-generation big data solutions. We hope that you’ll be able to attend this year and learn from your peers and industry experts about how open source technologies like Apache Hadoop, Apache Spark, and Apache NiFi […]
For years, supply chain professionals in manufacturing industries have been aspiring to create a truly demand-driven supply chain. Actual progress, in reality, has been slowed by both the limited availability of real-time supply chain data and the inability to dynamically optimize actions based on this information. However, as the Big Data movement continues to revolutionize […]
As we kick off the new year I wanted to thank our customers, partners, Apache community members, and of course the amazing Hortonworks team, for an amazing 2016. Let’s take a step back and look at some of the Hortonworks highlights from last year… IN THE ECOSYSTEM there was tremendous acceleration. At the beginning of […]
This is the third in a series of four posts on the Open Banking Standard (OBS) in the UK. This second post will briefly look at the strategic drivers for banks while proposing an architectural style or approach for incumbents to drive change in their platforms to achieve OBS Compliance. The final post will discuss a […]
Bob Glithero Analytics Product Marketing Manager, Pivotal Over the last five years, mobile network operators (MNOs) realized 15% lower compound revenue growth on average than other types of communication service providers. With few exceptions, MNOs globally have seen a long-term decline in average revenue per user (ARPU). To reinvigorate growth, innovative MNOs are searching for […]
Apache Spark has ignited an explosion of data exploration on very large data sets. Spark played a big role in making general purpose distributed compute accessible. Anyone with some level of skill in Python, Scala, Java, and now R, can just sit down and start exploring data at scale. It also democratized Data Science by […]
The first post in this series (http://hortonworks.com/blog/banking-innovation-uk-open-bank-project/) discussed the emergence of the Open Bank Standard Working Group (OBWG) in the United Kingdom. The goal of this standard is to encourage the open and secure sharing of banking data among providers – via open APIs- thus providing more banking service choices for consumers. Open Banking Standard will spur […]
We are pleased to announce that Hortonworks DataFlow (HDFTM) Version 2.1 is now generally available. You can download the latest version here! HDF 2.1 (powered by Apache NiFi, Apache Kafka and Apache Storm) brings enterprise readiness, platform stability and ease of use to the next level. Apache NiFi for dynamic, configurable data pipelines, through […]
“Banking as a service has long sat at the heart of our economy. In our digitally enabled world, the need to seamlessly and efficiently connect different economic agents who are buying and selling goods and services, is critical. The Open Banking Standard is a framework for making banking data work better: for customers; for businesses […]
Wow, I really can’t believe it has only been one year since we launched Hortonworks Community Connection — HCC. What started as a project to make communication between our technical teams more transparent has blossomed into a fantastic and engaging website. Here are just some of the interesting numbers: There are now over 40,000 assets […]
The first blog in this two part series (Deter Financial Crime by Creating an effective AML Program) described how Money Laundering (ML) activities employed by nefarious actors (e.g drug cartels, corrupt public figures & terrorist organizations) have gotten more sophisticated over the years. Global and Regional Banks are falling short of their compliance goals despite huge […]
It is that time of year again, right before Christmas in Las Vegas, where nearly 30,000 technologists gather to see the latest in innovation around the Cloud. Hortonworks is honored to participate as an exhibitor for the first time. If you are in Vegas this week for the AWS re:Invent, please stop by our booth #2732 […]
The first post in this three part series on Digital Foundations @ http://www.vamsitalkstech.com/?p=2517 introduced the concept of Customer 360 or Single View of Customer (SVC). We discussed specific benefits from both a business & operational standpoint that are enabled by SVC. This second post in the series introduces the concept of a Customer Journey. The third & final […]
As discussed in the previous blog in this series @ http://hortonworks.com/blog/frtb-fundamental-review-trading-book-changes-banking-risk-management/, the FRTB (Fundamental Review of the Trading Book) compels Banks to create unified teams from various departments – especially Risk, Finance, the Front Office (where trading desks sit) and Technology to address all of the above significant challenges of the regulation. From a technology capabilities standpoint, the FRTB […]
Earlier this year, we started making Technical Previews of Hortonworks Data Cloud for AWS available. The feedback and response has been incredible, and over the past few months, we performed many Technical Preview refreshes. Now we are ready to make it official and release the product into AWS Marketplace. Therefore, we are excited to announce […]
Regulatory Risk Management evolves… The Basel Committee of supranational supervision was put in place to ensure the stability of the financial system. The Basel Accords are the frameworks that essentially govern the risk taking actions of a bank. To that end, minimum regulatory capital standards are introduced that banks must adhere to. The Bank of International Settlements […]
Apache, Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie, Metron and the Hadoop elephant and Apache project logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States or other countries.