Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
February 28, 2014
prev slideNext slide

How To Configure Elasticsearch on Hadoop with HDP

Elasticsearch’s engine integrates with Hortonworks Data Platform 2.0 and YARN to provide real-time search and access to information in Hadoop.

See it in action:  register for the Hortonworks and Elasticsearch webinar on March 5th 2014 at 10 am PST/1pm EST to see the demo and an outline for best practices when integrating Elasticsearch and HDP 2.0 to extract maximum insights from your data.  Click here to register for this exciting and informative webinar!

Try it yourself: Get started with this tutorial using Elasticsearch and Hortonworks Data Platform, or Hortonworks Sandbox to access server logs in Kibana using Apache Flume for ingestion.


Following diagram depicts the proposed architecture to index the logs in near real-time into Elasticsearch and also save to Hadoop for long-term batch analytics.




Elasticsearch is a search engine that can index new documents in near real-time and make them immediately available for querying. Elasticsearch is based on Apache Lucene and allows for setting up clusters of nodes that store any number of indices in a distributed, fault-tolerant way. If a node disappears, the cluster will rebalance the (shards of) indices over the remaining nodes. You can configure how many shards make up each index and how many replicas of these shards there should be. If a master shard goes offline, one of the replicas is promoted to master and used to repopulate another node.


Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into different storage destinations like Hadoop Distributed File System. It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.


Kibana is an open source (Apache Licensed), browser based analytics and search interface to Logstash and other timestamped data sets stored in ElasticSearch. Kibana strives to be easy to get started with, while also being flexible and powerful

System Requirements

  • Hadoop: Hortonworks Data Platform 2.0(HDP 2.0) or HDP Sandbox for HDP 2.0
  • OS: 64 bit RHEL (Red Hat Enterprise Linux) 6, CentOS, Oracle Linux 6
  • Software:  yum, rpm, unzip, tar, wget, java
  • JDK: Oracle 1.7 64, Oracle 1.6 update 31, Open JDK 7

Java Installation

Note: Define the JAVA_HOME environment variable and add the Java Virtual Machine and the Java binaries to your PATH environment variable.

Execute the following command to verify that the Java is in the PATH:

export JAVA_HOME=/usr/java/default
export PATH=$JAVA_HOME/bin:$PATH
java -version

Flume Installation

Execute the following commands to install flume binaries and agent scripts
yum install flume-agent flume

Elasticsearch Installation

Latest Elasticsearch can be downloaded from the following URL

RPM Downloads can be found in

To install Elasticsearch on data nodes:

rpm -ivh elasticsearch-0.90.7.noarch.rpm

Setup and configure Elasticsearch

Update the following properties in  /etc/elasticsearch/elasticsearch.yml

  • Set cluster name "logsearch"
  • Set node name "node1"
  • By default every node is eligible to be master and stores data. Properties can be adjusted by
    • node.master: true
    • true
  • Number of shards can be adjusted by following property index.number_of_shards: 5
  • Number of replicas (Additional copies) can be set with index.number_of_replicas : 1
  • Adjust the path of data with /data1,/data2,/data3,/data4
  • Set to ensure a node sees N other master eligible nodes to be considered. This property needs to be set based on the size of the nodes discovery.zen.minimum_master_nodes: 1
  • Set the time to wait for ping responses from other nodes when discovering. Value needs to be higher for slow or congested network 3s
  • Disable the following, only if multicast is not supported in the network false

Note: Configure an initial list of master nodes in the cluster, if multicast is disabled ["host1", "host2:port"]

Logging properties can be adjusted in /etc/elasticsearch/logging.yml. The default log location is: /var/log/elasticsearch

Starting and Stopping Elasticsearch

  • To start Elasticsearch /etc/init.d/elasticsearch start
  • To stop Elasticsearch/etc/init.d/elasticsearch stop

Kibana Installation

Download the Kibana binaries from the following URL


Extract archive with tar –zxvf kibana-3.0.0milestone4.tar.gz

Setup and configure Kibana

  • Open config.js file under the extracted directory
  • Set the elasticsearch parameter to the fully qualified hostname or IP of your Elasticsearch server.
  • elasticsearch: http://<YourIP>:9200
  • Open index.html in your browser to access Kibana UI
  • Update the logstash index pattern to Flume supported index pattern
  • Edit app/dashboards/logstash.json and replace all occurences of [logstash-]YYYY.MM.DD with [logstash-]YYYY-MM-DD

Setup and configure Flume

For demonstration purpose, lets setup and configure a Flume agent on a host where log file needs to be consumed with the following Flume configuration.

Create plugins.d directory and copy the Elasticsearch dependencies:

mkdir /usr/lib/flume/plugins.d
cp $elasticsearch_home/lib/elasticsearch-0.90*jar /usr/lib/flume/plugins.d
cp $elasticsearch_home/lib/lucene-core-*jar /usr/lib/flume/plugins.d

Update Flume configuration to consume a local file and index into Elasticsearch in logstash format. Note: in a real-world use cases, Flume Log4j Appender, Syslog TCP Source, Flume Client SDK, Spool Directory Source are preferred over tailing logs.

agent.sources = tail
agent.channels = memoryChannel
agent.channels.memoryChannel.type = memory
agent.sources.tail.channels = memoryChannel
agent.sources.tail.type = exec
agent.sources.tail.command = tail -F /tmp/es_log.log
agent.sources.tail.interceptors=i1 i2 i3
agent.sources.tail.interceptors.i1.regex = (\w.*):(\w.*):(\w.*)\s
agent.sources.tail.interceptors.i1.serializers = s1 s2 s3 = source = type = src_path
agent.sources.tail.interceptors.i3.hostHeader = host
agent.sinks = elasticsearch = memoryChannel
agent.sinks.elasticsearch.hostNames =
agent.sinks.elasticsearch.indexName = logstash
agent.sinks.elasticsearch.clusterName = logsearch
agent.sinks.elasticsearch.serializer = org.apache.flume.sink.elasticsearch.ElasticSearchLogStashEventSerializer

Prepare sample data for a simple test

Create a file /tmp/es_log.log with the following data

website:weblog:login_page weblog data1
website:weblog:profile_page weblog data2
website:weblog:transaction_page weblog data3
website:weblog:docs_page weblog data4
syslog:syslog:sysloggroup syslog data1
syslog:syslog:sysloggroup syslog data2
syslog:syslog:sysloggroup syslog data3
syslog:syslog:sysloggroup syslog data4

Restart Flume

/etc/init.d/flume-agent restart

Searching and Dashboarding with Kibana

Open the $KIBANA_HOME/index.html in browser. By default the welcome page is shown.

Click on “Logstash Dashboard”  and select the appropriate time range to look at the charts based on the time stamped fields.


These screen shots show various available charts on search fields. e.g. Pie, Bar, Table charts

es4 es5

Content can be searched with custom filters and graphs can be plotted based on the search results as shown below.


Batch Indexing using MapReduce/Hive/Pig

Elasticsearch’s real-time search and analytics are natively integrated with Hadoop. and support MapReduceCascadingHive and Pig.

Component Implementation notes
MR2/YARN ESInputFormatESOutputFormat Mapreduce input and out formats are provided by the library
Hive org.elasticsearch.hadoop.hive.ESStorageHandler Hive SerDe implementation
Pig org.elasticsearch.hadoop.pig.ESStorage Pig storage handler

Detailed Documentation with examples related to Elasticsearch hadoop integration can be found in the following URL

Thoughts on Best Practices

  1. Always set minimum_master_nodes to higher to avoid split brain (number of nodes / 2 + 1)
  2. discovery.zen.minimum_master_nodes should be set to something like N/2 + 1 where N is the number of available master nodes.
  3. Set action.disable_delete_all_indices to disable accidental deletes
  4. Set gateway.recover_after_nodes to appropriate number of nodes need to be up before the recovery process starts replicating data around the cluster.
  5. Relax the real time aspect from 1 second to something a bit higher (index.engine.robin.refresh_interval).
  6. Increase the memory allocated to Elasticsearch node. By default its 1g.
  7. Use Java 7 if possible for better performance with elastic search
  8. Set index.fielddata.cache: soft to avoid OutOfMemory errors
  9. Use higher batch sizes in flume sink for higher throughput. E.g 1000
  10. Increase the open file limits for Elasticsearch


William Fox says:
Your comment is awaiting moderation.

This tutorial is straight forward, it does not look like it takes allot of time to get it up and running . I like the detail in Kibana as it would be very useful in my environment. I am looking to get ElasticSearch, Flume, and Kibana setup in my lab. I will post my feedback. Thanks

Hari Sekhon says:
Your comment is awaiting moderation.

Doesn’t this cause the data to be stored twice, once in HDFS and once in ElasticSearch?

SolrCloud native HDFS integration as seen in Cloudera Search seems have an advantage in that regard.

Bảo Trần says:

Step —“Setup and configure Flume”
I can not find where are two *jar files, is that exactly location ($elasticsearch_home/lib/elasticsearch).

Ram says:

I successfully started Elastic Search but I was not able to open index.html file. Could someone please tell me how to open index.html from sandbox. I am using Azure HDP Sandbox.

Ram says:

I tried http::9200/index.html, but it didn’t work.

Raghavendra says:
Your comment is awaiting moderation.

I only setup details for extracting kibana files, but no details on how to start kibana
Elasticsearch is fine

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums