Elasticsearch provides a real-time, distributed, open source search and analytics platform for structured and unstructured data.
The partnership between Hortonworks and Elasticsearch enables Hortonworks customers and prospects to add Elasticsearch real-time search and analytics on top of Hortonworks Data Platform (HDP). This allows HDP customers to complement their current investment in flexible batch processing with software that enables new use cases that are possible only with real-time interaction with the user.
Elasticsearch is a great fit for “Big Data” because its scalable, distributed nature allows it to search – and store – vast amounts of information in near real-time. Through the Elasticsearch-Hadoop integration, Elasticsearch enables HDP users (including native MapReduce, Hive, Pig and Cascading) to enhance their workflow with a search and analytics engine. Elasticsearch provides a rich language to ask better questions in order to get clearer answers, significantly faster.
Developers can write MapReduce jobs that index existing data in HDFS, enabling search through the Elasticsearch REST API and related ecosystem. Developers can also enable MapReduce jobs to read and write the input and output datasets to and from Elasticsearch. This deep integration extends to Hive, Pig and Cascading.
The Elasticsearch-Hadoop project provides a dedicated InputFormat and OutputFormat for vanilla MapReduce, Taps for reading and writing data in Cascading, and Storages for Pig and Hive so you can access Elasticsearch just as if the data were in HDFS.
The integration enables cluster co-locations by exposing shard information to Hadoop. Job tasks are run on the same machines as the Elasticsearch shards themselves, eliminating network traffic and improving performance through data locality.
For more information
HDP - HDP Certified badge indicates this partner’s solution has been certified to work with HDP; reviewed for architectural best practices and validated against a comprehensive suite of integration test cases, benchmarked for scale under varied workloads and comprehensively documented.
Yarn Ready - Apache Hadoop YARN is the data operating system for Hadoop 2. YARN Ready certification recognizes applications that integrate with YARN and process data via pushdown computation to the cluster. Examples of a YARN ready solution includes an application that has native YARN application master or leverages scale-out capabilities of the platform like Hive, Spark and MR2.