Apache Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Hadoop services provide for data storage, data processing, data access, data governance, security, and operations.
The genesis of Hadoop came from the Google File System paper that was published in October 2003. This paper spawned another research paper from Google – MapReduce: Simplified Data Processing on Large Clusters. Development started in the Apache Nutch project, but was moved to the new Hadoop subproject in January 2006. The first committer added to the Hadoop project was Owen O’Malley in March 2006. Hadoop 0.1.0 was released in April 2006 and continues to be evolved by the many contributors to the Apache Hadoop project. Hadoop was named after one of the founder’s toy elephant.
In 2011, Rob Bearden partnered with Yahoo! to establish Hortonworks with 24 engineers from the original Hadoop team including founders Alan Gates, Arun Murthy, Devaraj Das, Mahadev Konar, Owen O’Malley, Sanjay Radia, and Suresh Srinivas.
Some of the reasons organizations use Hadoop is its’ ability to store, manage and analyze vast amounts of structured and unstructured data quickly, reliably, flexibly and at low-cost.
The Hadoop Distributed File System (HDFS) provides scalable, fault-tolerant, cost-efficient storage for your big data lake. It was designed to span large clusters of commodity servers scaling up to hundreds of petabytes and thousands of servers. By distributing storage across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.
MapReduce is the original framework for writing massively parallel applications that process large amounts of structured and unstructured data stored in HDFS. MapReduce can take advantage of the locality of data, processing it near the place it is stored on each node in the cluster in order to reduce the distance over which it must be transmitted.
More recently, Apache Hadoop YARN opened Hadoop to other data processing engines that can now run alongside existing MapReduce jobs to process data in many different ways at the same time, such as Apache Spark. YARN provides the centralized resource management that enables you to process multiple workloads simultaneously. YARN is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize a modern data architecture.
Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data.
Applications can interact with the data in Hadoop using batch or interactive SQL (Apache Hive) or low-latency access with NoSQL (Apache HBase). Hive allows business users and data analysts to use their preferred business analytics, reporting and visualization tools with Hadoop. Data stored in HDFS in Hadoop can be searched using Apache Solr.
The Hadoop ecosystem extends data access and processing with powerful tools for data governance and integration including centralized security administration (Apache Ranger) and data classification tagging (Apache Atlas), which combined enable dynamic data access policies that proactively prevent data access violations from occurring. Hadoop perimeter security is also available to integrate with existing enterprise security systems and control user access to Hadoop (Apache Knox).
A very common request from many customers is to be able to index text in image files; for example, text in scanned PNG files. In this tutorial we are going to walkthrough how to do this with SOLR. Prerequisites Download the Hortonworks Sandbox Complete the Learning the Ropes of the HDP Sandbox tutorial. Step-by-step guide […]
Overview The Azure cloud infrastructure has become a commonplace for users to deploy virtual machines on the cloud due to its’ flexibility, ease of deployment, and cost benefits. In addition, Microsoft has expanded Azure to include a marketplace with thousands of certified, open source, and community software applications, developer services, and data—pre-configured for Microsoft Azure. […]
Introduction Hortonworks has recently announced the integration of Apache Atlas and Apache Ranger, and introduced the concept of tag or classification based policies. Enterprises can classify data in Apache Atlas and use the classification to build security policies in Apache Ranger. This tutorial walks through an example of tagging data in Atlas and building a […]
Introduction Hortonworks introduced Apache Atlas as part of the Data Governance Initiative, and has continued to deliver on the vision for open source solution for centralized metadata store, data classification, data lifecycle management and centralized security. Atlas is now offering, as a tech preview, cross component lineage functionality, delivering a complete view of data movement […]
Apache Zeppelin on HDP 2.4.2 Author: Vinay Shukla In March 2016 we delivered the second technical preview of Apache Zeppelin, on HDP 2.4. Meanwhile we and the Zeppelin community have continued to add new features to Zeppelin. These features are now available in the final technical preview of Apache Zeppelin. This technical preview works with […]
Introduction In this tutorial, we will give you a taste of the powerful Machine Learning libraries in Apache Spark via a hands-on lab. We will also introduce the necessary steps to get you up and running with Apache Zeppelin on a Hortonworks Data Platform (HDP) Sandbox. Prerequisites This tutorial is a part of series of […]
In this tutorial we will explore how you can use policies in HDP Advanced Security to protect your enterprise data lake and audit access by users to resources on HDFS, Hive and HBase from a centralized HDP Security Administration Console.
Introduction Apache Ranger delivers a comprehensive approach to security for a Hadoop cluster. It provides a central security policy administration across the core enterprise security requirements of authorization, accounting and data protection. Apache Ranger already extends baseline features for coordinated enforcement across Hadoop workloads from batch, interactive SQL and real–time in Hadoop. In this tutorial, […]
Hortonworks Empowers Organizations to Maximize the Outcome of their Big Data Initiatives through improvements in security, governance, and operations. We are very pleased to announce that the Hortonworks Data Platform (HDP) Version 2.5 is now generally available for download. As part of a Open and Connected Data Platforms offering from Hortonworks, HDP 2.5 brings a […]
The neighborhood bank branch is on the way out and is being slowly phased out as the primary mode of customer interaction for Banks. Banks across the globe have increased their technology investments in strategic areas such as Analytics, Data & Mobile. The Bank of the future increasingly resembles a technology company. The Washington Post proclaimed in an […]
Baker Hughes CEO Martin Craighead says: “If a typical deep water well is like going to the moon, then the Gulf of Mexico ultra-deep water frontier is like going to Mars.”* Safely performing these kinds of complex and high risk operations requires many people to collaborate, share information and make informed decisions quickly. When […]
On October 1, 2016, the Hortonworks Certification Program is changing its structure: Our four current exams – HDPCD, HDPCD:Spark, HDPCD:Java and HDPCA – are being retired and replaced with a set of new exams. We are introducing three levels of certification: Associate: the new entry level into our certification program Professional: for experienced data professionals […]
“IT driven business transformation is always bound to fail” – Amber Storey, Sr Manager, Ernst & Young The value of Big Data driven Analytics is no longer in question both from a customer as well as an enterprise standpoint. Lack of investment in an analytic strategy has the potential to impact shareholder value negatively. Business Boards […]
ホートンワークスは、2016年9月9日（金）、日本最大の金融ITフェア「FIT 2016」（金融国際情報技術展）に出展します。 FIT 2016（金融国際情報技術展）について 名称： FIT2016 (Financial Information Technology 2016) 金融国際情報技術展 公式ページはこちら 主催： 日本金融通信社（ニッキン） 会場： 東京国際フォーラム（東京・有楽町）ホールE・ホールD5・ガラス棟:MAP 会期： 2016年9月8日(木) – 9月9日(金) 2日間開催 展示会： 10:00～18:00 入場無料： 金融機関（証券・保険・ノンバンクなども含む）及び、金融機関系列会社の方はご入場が自由です。それ以外の方は、入場券が必要となります。一般企業の方で入場券が必要な方は、ホートンワークス (firstname.lastname@example.org) までご連絡ください。 ▼ 参加申し込み ▼ https://fit.smartseminar.jp/public/application/add/228#seminar1020 ホートンワークスジャパンのセッション 【第1部】保険業界におけるビッグデータ戦略の課題と解決策 〜最新事例から学ぶ〜 日時: 2016年9月9日 10:00 – 11:00 場所: ガラス棟5F GC会場（G-502号室） 概要: このセッションでは、保険業界で先行企業がどのようにビッグデータを活用しているか、事例を取り上げながら解説します。例えば、利用ベース自動車保険（UBI: Usage-Based Insurance）は世界的に普及が進んでおり、米国や英国などの先行市場に加えて、日本でも成長が見られます。しかしデータ量の増加と必要なデータの速度と多様性が既存のシステムとプロセスに負担をかける場合があります。また、損害査定時に、詐欺行為と有効な請求とを区別する必要がありますが、クレームメモやソーシャルメディア分析結果などの非構造化データを組み合わせる事で分析効率が向上すると言われています。先行企業でこのような問題をどのように解決しているかを紹介します。 【第2部】金融業界におけるビッグデータ戦略の課題と解決策 〜最新事例から学ぶ〜 時間: 2016年9月9日 11:15 – 12:15 場所: ガラス棟5F GC会場（G-502号室） 概要: […]
Ram Venkatesh also contributed to this blog series Why Apache Hadoop in the Cloud? Ten years ago, Hadoop, the elephant started the Big Data journey inside the firewall of a data center- the Apache Hadoop components were deployed on commodity servers inside a private data center. Now, the public cloud is another viable option for […]
Financial Services is arguably one of the most complex industry sectors in terms of customers, services and regulation. While there are obviously areas of overlap, the contrast between capital markets to private banking, retail banking to corporate banking and lending, or hedge funds to credit cards and payment networks is considerable. Last week, our general […]
Recent industry research by both Strategy Meets Action (SMA) and Novarica highlights analytics as the top priority for the insurance industry. Further, the Insurers’ 2016 Strategic Initiatives: Advancing Industry Transformation report by SMA identified customer engagement as another top priority for insurers. Success in the insurance industry depends on your company’s ability to quickly interact […]
This article is the second installment in a three part series that covers one of the most critical issues facing the financial industry – Trade Surveillance. While the first (and previous) post discussed the global scope of the problem across multiple global jurisdictions – this post will discuss a candidate Big Data & Cloud Computing Architecture that can help market participants […]
It has been another exciting week on Hortonworks Community Connection HCC. We continue to see great activity and recommend the following assets from last week. Top Articles from HCC Implementing a real-time Hive Streaming example by:mjohnson The Hive Streaming API enables the near real-time data ingestion into Hive. This two part posting reviews some of […]
“From coast to coast, the FBI and Securities and Exchange Commission have ensnared people not only at hedge funds, but at technology and pharmaceutical companies, consulting and law firms, government agencies, and even a major stock exchange.” – Preet Bharara, U.S. Attorney for the Southern District of New York, 2013; while announcing charges in a massive […]
User Interface and User Experience are some of the most important aspects of developing a product. No matter how many amazing features something has, a user must be able to access them in order to reap the full benefits of the product. For example, in the Apache Ambari Web UI, add-on apps called Views have, […]
みなさま Hortonworksでマーケティングを担当している北瀬です。Hortonworksに入社して1か月がたちましたので、ブログなど書いてみようかと思います。と言っても今回は「Hadoop Summit 2016 Tokyo」の紹介になります。 先日、6月28日〜30日、アメリカ、サンノゼで行われていた「Hadoop Summit 2016 San Jose」の様子はこちらで紹介されていますが、その熱気が日本にもやってきます！10月26日、27日に日本で初めてApache Hadoopのグローバルイベント「Hadoop Summit」が開催されます。只今、Hadoop Summit 実行委員会ではスピーカーを募集しています。ご興味ある方は、是非ご応募ください。 Hadoop Summit 2016 San Joseの様子 10周年を迎えたHadoop，データ分析の主戦場はクラウドとデータセンターの連携に ―「Hadoop Summit 2016 San Jose」レポート Hadoop Summitに見る、ビッグデータエコシステムの秩序と分断 テクノロジとプレーヤーが出揃った！北瀬公彦の「Hadoop Summit 2016」レポート Hadoop Summit 2016 Tokyo セッション募集カテゴリー ビジネス ビジネスに影響を与えた実際の事例 テクニカル Apache コミッターによる発表 アプリケーション開発、分析、データサイエンス ガバナンス、セキュリティ、運用管理 モダンデータアプリケーション、IoT、ストリーミング 応募に関して 応募方法: 下記より応募ください。発表は日本語でも問題ありませんが、応募に関しては英語お願いいたします。 Hadoop Summit 2016 Tokyo Call For Abstracts 締め切り: 8月12日（金） 何かご質問などありましたら、Melissa […]
It has been another exciting week on Hortonworks Community Connection HCC. We continue to see great activity and recommend the following assets from last week. Top Articles from HCC HDF installation on EC2 by:mpandit Hortonworks DataFlow (HDF) powered by Apache NiFi, Kafka and Storm, collects, curates, analyzes and delivers real-time data from the IoAT to […]
(Image Courtesy – www.theastuteadvisor.com) “Perhaps more than anything else, failure to recognize the precariousness and fickleness of confidence-especially in cases in which large short-term debts need to be rolled over continuously-is the key factor that gives rise to the this-time-is-different syndrome.Highly indebted governments, banks, or corporations can seem to be merrily rolling along for an extended period, when bang!-confidence collapses, […]
It has been another exciting week on Hortonworks Community Connection HCC. We continue to see great activity and recommend the following assets from last week. Top Articles from HCC Phoenix HBase Tuning – Quick Hits by:smanjee HBase tuning like any other service within the ecosystem requires understanding of the configurations and the impact (good or […]
Following the success of our sold-out 2015 Roadshow, we are pleased to announce our worldwide Future of Data Roadshow 2016! The Roadshow brings the innovators driving the future of data to you and offers insightful content for both business and technical attendees. This is an invaluable opportunity to network with leaders who are transforming their business […]
It has been another exciting week on Hortonworks Community Connection HCC. We continue to see great activity and recommend the following assets from last week. Top Articles from HCC Horses for Courses: Apache Spark Streaming and Apache Nifi by:vvaks Comparing Apache Nifi and Apache Spark Streaming for different streaming and IOT use cases Data Analysis […]
Hadoop Summit in San Jose wrapped up a few weeks ago. This was the ninth year and, wow, have we come a long way. It’s been a decade for Apache Hadoop and five years for Hortonworks. Hadoop Summit is the leading conference for Hadoop and data management, and this year saw well over 4,000 attendees […]
It has been another exciting week on Hortonworks Community Connection HCC. We have lots of great technical content and are continuing to see great activity. We recommend the following assets from last week: Top Articles from HCC Disaster recovery and Backup best practices in a typical Hadoop Cluster :Series 1 Introduction by:rbiswas Disaster recovery plan […]
It has been another exciting week on Hortonworks Community Connection HCC. We have lots of great technical content and are continuing to see great activity. We recommend the following assets from last week: Top Articles from HCC Adding KDC Administrator Credentials to the Ambari Credential Store by:rlevas Rack Awareness by:rbiswas Spark+Pycharm+Pybuilder on Docker by:smanjee YARN […]
“The data fabric is the next middleware.” –Todd Papaioannou, CTO at Splunk Enterprises across the globe are confronting the need to create a Digital Strategy. While the term itself may seen intimidating to some, to the business it essentially implies an agile culture built on customer centricity & responsiveness. The only way to attain Digital success […]
I was back ‘home’ for Hadoop Summit San Jose last week and I have to admit, it was fantastic to be hosting our customers and partners from across Europe, Middle East, Africa and Asia! It was a true testament to the relationships I’ve seen develop first hand within our international business over the past 12 […]
According to Strategy Meets Action (SMA), the value and disruption do not come from the “things” or the technology itself. New, actionable insights can be gleaned from massive amounts of new data being collected and analyzed. Insurers must build strong enterprise-wide data management and analytics capabilities to be in a position to capitalize on these […]
Apache, Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie, Metron and the Hadoop elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States or other countries.