Apache Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Hadoop services provide for data storage, data processing, data access, data governance, security, and operations.
The genesis of Hadoop came from the Google File System paper that was published in October 2003. This paper spawned another research paper from Google – MapReduce: Simplified Data Processing on Large Clusters. Development started in the Apache Nutch project, but was moved to the new Hadoop subproject in January 2006. The first committer added to the Hadoop project was Owen O’Malley in March 2006. Hadoop 0.1.0 was released in April 2006 and continues to be evolved by the many contributors to the Apache Hadoop project. Hadoop was named after one of the founder’s toy elephant.
In 2011, Rob Bearden partnered with Yahoo! to establish Hortonworks with 24 engineers from the original Hadoop team including founders Alan Gates, Arun Murthy, Devaraj Das, Mahadev Konar, Owen O’Malley, Sanjay Radia, and Suresh Srinivas.
Some of the reasons organizations use Hadoop is its’ ability to store, manage and analyze vast amounts of structured and unstructured data quickly, reliably, flexibly and at low-cost.
The Hadoop Distributed File System (HDFS) provides scalable, fault-tolerant, cost-efficient storage for your big data lake. It was designed to span large clusters of commodity servers scaling up to hundreds of petabytes and thousands of servers. By distributing storage across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.
MapReduce is the original framework for writing massively parallel applications that process large amounts of structured and unstructured data stored in HDFS. MapReduce can take advantage of the locality of data, processing it near the place it is stored on each node in the cluster in order to reduce the distance over which it must be transmitted.
More recently, Apache Hadoop YARN opened Hadoop to other data processing engines that can now run alongside existing MapReduce jobs to process data in many different ways at the same time, such as Apache Spark. YARN provides the centralized resource management that enables you to process multiple workloads simultaneously. YARN is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize a modern data architecture.
Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data.
Applications can interact with the data in Hadoop using batch or interactive SQL (Apache Hive) or low-latency access with NoSQL (Apache HBase). Hive allows business users and data analysts to use their preferred business analytics, reporting and visualization tools with Hadoop. Data stored in HDFS in Hadoop can be searched using Apache Solr.
The Hadoop ecosystem extends data access and processing with powerful tools for data governance and integration including centralized security administration (Apache Ranger) and data classification tagging (Apache Atlas), which combined enable dynamic data access policies that proactively prevent data access violations from occurring. Hadoop perimeter security is also available to integrate with existing enterprise security systems and control user access to Hadoop (Apache Knox).
A very common request from many customers is to be able to index text in image files; for example, text in scanned PNG files. In this tutorial we are going to walkthrough how to do this with SOLR. Prerequisites Download the Hortonworks Sandbox Complete the Learning the Ropes of the HDP Sandbox tutorial. Step-by-step guide […]
Overview The Azure cloud infrastructure has become a commonplace for users to deploy virtual machines on the cloud due to its’ flexibility, ease of deployment, and cost benefits. In addition, Microsoft has expanded Azure to include a marketplace with thousands of certified, open source, and community software applications, developer services, and data—pre-configured for Microsoft Azure. […]
In this tutorial we will explore how you can use policies in HDP Advanced Security to protect your enterprise data lake and audit access by users to resources on HDFS, Hive and HBase from a centralized HDP Security Administration Console.
Introduction Apache Ranger delivers a comprehensive approach to security for a Hadoop cluster. It provides a central security policy administration across the core enterprise security requirements of authorization, accounting and data protection. Apache Ranger already extends baseline features for coordinated enforcement across Hadoop workloads from batch, interactive SQL and real–time in Hadoop. In this tutorial, […]
In this tutorial we are going to explore how we can configure YARN CapacityScheduler from Ambari. What is the YARN's CapacityScheduler? YARN's CapacityScheduler is designed to run Hadoop applications in a shared, multi-tenant cluster while maximizing the throughput and the utilization of the cluster. Traditionally each organization has it own private set of compute resources […]
Overview Apache Ambari is a completely open operational framework for provisioning, managing and monitoring Apache Hadoop clusters. Ambari includes an intuitive collection of operator tools and a set of APIs that mask the complexity of Hadoop, simplifying the operation of clusters. In this tutorial, we will walk through the some of the key aspects of […]
Introduction Hortonworks has recently announced the integration of Apache Atlas and Apache Ranger, and introduced the concept of tag or classification based policies. Enterprises can classify data in Apache Atlas and use the classification to build security policies in Apache Ranger. This tutorial walks through an example of tagging data in Atlas and building a […]
This tutorial will help you get started with Hadoop and HDP. We will use an Internet of Things (IoT) use case to build your first HDP application.
Cloud Computing is one of the big three trends impacting IT architectures today. What some may not realize is that an underlying connected data architecture is not only essential for cloud, but sits at the confluence of all three trends. Here’s why. The first big trend is IoT. According to BI Intelligence, we can now […]
The Hadoop community is gathering this week to hear from data scientists, innovators and thought leaders on the state of the data industry. A wide range of topics will be covered, ranging from Hadoop use cases to data visualization and user experience. Customers looking for comprehensive solutions to manage all of their data needs rely […]
In the US fast food industry, this is a common question when you order a burger. ‘You want fries with that?’ It’s in the American psyche at this point, and has become common parlance. I was recently heard this exchange: ‘Hey, can I get a copy of your targeted promos report?’ ‘Sure! You want […]
How Hortonworks can help hotel industry capture value through Insights Aggregation and Predictive Analytics Big Data has transformed every industry including the hospitality vertical. Through customer analytics, targeted segmentation, and campaigning, hotels would like to focus on delivering personalized promotions, cross and up-selling travel services. Our objective is to address these challenges through an open-source […]
My life as part of a high performance team Last week we released Hortonworks DataFlow HDF 2.0. It was a great 1 year anniversary present for me – a new release of the product I’ve been supporting since I joined Hortonworks a year ago. I’ve had the privilege of working with the most talented, quick-thinking, […]
Hortonworks DataFlow (HDF) 2.0 is now available! HDF is powered by Apache NiFi 1.0.0, which recently underwent a major redesign. Whether you’re a current user or just now planning to try it out, this is exciting news. A lot of new feature content went into this release such as multi-tenancy and zero-master clustering. The purpose […]
Last week I had a unique opportunity to present to a group of C-level retail industry leaders. Here are five stories I heard that you might find interesting. These are leaders in Merchandising, Marketing, Infrastructure and IT in top European companies. The common link was dinner and retail. I spoke briefly about my experience in retail and adoption of […]
Hadoop for Health Insurance: The future of the healthcare industry rests in the promise of collecting, analyzing and taking action on the output of larger amounts of information. Through the advancements in big data, machine learning and advanced analytics, healthcare organizations can leverage and manipulate data to improve overall member health, reduce costs, improve quality […]
This guest blog post is from our partner Attunity who is a leading provider of big data management software solutions that enable access, management, sharing and distribution of Big Data, across heterogeneous enterprise platforms, organizations, and the cloud. Carole Gunst, Director of Marketing at Attunity, outlines our partnership and three opportunities to help customers learn […]
As enterprises around the world bring more of their sensitive data into Hadoop data lakes, balancing the need for democratization of access to data without sacrificing strong security principles becomes paramount. According to a recent research report by Securosis, “Hadoop has (mostly) reached security parity with the relational platforms of old, and that’s saying a […]
A lot has been said about Data Lakes over the past five years. The call to action from our industry to customers was to take all your data-at-rest in databases and warehouses, and add to this to the data-in-motion from everything in your ecosystem. Then store all of the resulting terabytes and petabytes in a […]
You may have seen our invite to join the genomics consortium Let me recap a little about what this is about and catch you up to speed on our progress and next steps. Hortonworks is quarterbacking a consortium of leading healthcare organizations and subject matter experts to help develop the platform requirements for next generation […]
I just left a sold-out Melbourne Hadoop Summit 2016 in Australia. This was the first Summit in Asia Pacific and I was excited by tremendous response from the global and local community, and from regional organizations and businesses. The buzz was everywhere. We’re proud to be the host and the organizer. We couldn’t pull […]
This April, Hortonworks launched a multi-phase initiative to streamline Apache Hadoop operations, and the 1.3 release of SmartSense marks the delivery of the second phase of that initiative, and that is to provide Consolidated Cluster Activity Reporting. Hortonworks launched SmartSense in 2015 to help customers quickly collect cluster configuration, metrics, and logs to proactively detect […]
Given my role, and as I’ve outlined in my previous blog, the role of big data in marketing is a topic I’m particularly interested in. Recently, I hosted a webinar with Luca Olivari, chief data officer at Contactlab and it reignited my interest in how my fellow marketing peers are – or are not – […]
We are pleased to announce the latest release of Apache Ambari 2.4 which further simplifies Hadoop Operations. With Ambari 2.4 (which is part of the recently released Hortonworks Data Platform 2.5, enterprises can plan, install and securely configure the Hortonworks Data Platform and easily provide ongoing maintenance and management. This new release includes an integrated […]
Hortonworks Empowers Organizations to Maximize the Outcome of their Big Data Initiatives through improvements in security, governance, and operations. We are very pleased to announce that Hortonworks Data Platform (HDP) Version 2.5 is now generally available for download. As part of a Open and Connected Data Platforms offering from Hortonworks, HDP 2.5 brings a variety of […]
The neighborhood bank branch is on the way out and is being slowly phased out as the primary mode of customer interaction for Banks. Banks across the globe have increased their technology investments in strategic areas such as Analytics, Data & Mobile. The Bank of the future increasingly resembles a technology company. The Washington Post proclaimed in an […]
Baker Hughes CEO Martin Craighead says: “If a typical deep water well is like going to the moon, then the Gulf of Mexico ultra-deep water frontier is like going to Mars.”* Safely performing these kinds of complex and high risk operations requires many people to collaborate, share information and make informed decisions quickly. When […]
“IT driven business transformation is always bound to fail” – Amber Storey, Sr Manager, Ernst & Young The value of Big Data driven Analytics is no longer in question both from a customer as well as an enterprise standpoint. Lack of investment in an analytic strategy has the potential to impact shareholder value negatively. Business Boards […]
ホートンワークスは、2016年9月9日（金）、日本最大の金融ITフェア「FIT 2016」（金融国際情報技術展）に出展します。 FIT 2016（金融国際情報技術展）について 名称： FIT2016 (Financial Information Technology 2016) 金融国際情報技術展 公式ページはこちら 主催： 日本金融通信社（ニッキン） 会場： 東京国際フォーラム（東京・有楽町）ホールE・ホールD5・ガラス棟:MAP 会期： 2016年9月8日(木) – 9月9日(金) 2日間開催 展示会： 10:00～18:00 入場無料： 金融機関（証券・保険・ノンバンクなども含む）及び、金融機関系列会社の方はご入場が自由です。それ以外の方は、入場券が必要となります。一般企業の方で入場券が必要な方は、ホートンワークス (email@example.com) までご連絡ください。 ▼ 参加申し込み ▼ https://fit.smartseminar.jp/public/application/add/228#seminar1020 ホートンワークスジャパンのセッション 【第1部】保険業界におけるビッグデータ戦略の課題と解決策 〜最新事例から学ぶ〜 日時: 2016年9月9日 10:00 – 11:00 場所: ガラス棟5F GC会場（G-502号室） 概要: このセッションでは、保険業界で先行企業がどのようにビッグデータを活用しているか、事例を取り上げながら解説します。例えば、利用ベース自動車保険（UBI: Usage-Based Insurance）は世界的に普及が進んでおり、米国や英国などの先行市場に加えて、日本でも成長が見られます。しかしデータ量の増加と必要なデータの速度と多様性が既存のシステムとプロセスに負担をかける場合があります。また、損害査定時に、詐欺行為と有効な請求とを区別する必要がありますが、クレームメモやソーシャルメディア分析結果などの非構造化データを組み合わせる事で分析効率が向上すると言われています。先行企業でこのような問題をどのように解決しているかを紹介します。 【第2部】金融業界におけるビッグデータ戦略の課題と解決策 〜最新事例から学ぶ〜 時間: 2016年9月9日 11:15 – 12:15 場所: ガラス棟5F GC会場（G-502号室） 概要: […]
Ram Venkatesh also contributed to this blog series Why Apache Hadoop in the Cloud? Ten years ago, Hadoop, the elephant started the Big Data journey inside the firewall of a data center- the Apache Hadoop components were deployed on commodity servers inside a private data center. Now, the public cloud is another viable option for […]
Financial Services is arguably one of the most complex industry sectors in terms of customers, services and regulation. While there are obviously areas of overlap, the contrast between capital markets to private banking, retail banking to corporate banking and lending, or hedge funds to credit cards and payment networks is considerable. Last week, our general […]
Recent industry research by both Strategy Meets Action (SMA) and Novarica highlights analytics as the top priority for the insurance industry. Further, the Insurers’ 2016 Strategic Initiatives: Advancing Industry Transformation report by SMA identified customer engagement as another top priority for insurers. Success in the insurance industry depends on your company’s ability to quickly interact […]
This article is the second installment in a three part series that covers one of the most critical issues facing the financial industry – Trade Surveillance. While the first (and previous) post discussed the global scope of the problem across multiple global jurisdictions – this post will discuss a candidate Big Data & Cloud Computing Architecture that can help market participants […]
Apache, Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie, Metron and the Hadoop elephant and Apache project logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States or other countries.