Elasticsearch’s engine integrates with Hortonworks Data Platform 2.0 and YARN to provide real-time search and access to information in Hadoop.
See it in action: register for the Hortonworks and Elasticsearch webinar on March 5th 2014 at 10 am PST/1pm EST to see the demo and an outline for best practices when integrating Elasticsearch and HDP 2.0 to extract maximum insights from your data. Click here to register for this exciting and informative webinar!
Try it yourself: Get started with this tutorial using Elasticsearch and Hortonworks Data Platform, or Hortonworks Sandbox to access server logs in Kibana using Apache Flume for ingestion.
Following diagram depicts the proposed architecture to index the logs in near real-time into Elasticsearch and also save to Hadoop for long-term batch analytics.
Elasticsearch is a search engine that can index new documents in near real-time and make them immediately available for querying. Elasticsearch is based on Apache Lucene and allows for setting up clusters of nodes that store any number of indices in a distributed, fault-tolerant way. If a node disappears, the cluster will rebalance the (shards of) indices over the remaining nodes. You can configure how many shards make up each index and how many replicas of these shards there should be. If a master shard goes offline, one of the replicas is promoted to master and used to repopulate another node.
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into different storage destinations like Hadoop Distributed File System. It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.
Kibana is an open source (Apache Licensed), browser based analytics and search interface to Logstash and other timestamped data sets stored in ElasticSearch. Kibana strives to be easy to get started with, while also being flexible and powerful
Note: Define the JAVA_HOME environment variable and add the Java Virtual Machine and the Java binaries to your PATH environment variable.
Execute the following command to verify that the Java is in the PATH:
Execute the following commands to install flume binaries and agent scripts
yum install flume-agent flume
Latest Elasticsearch can be downloaded from the following URL http://www.elasticsearch.org/download/
RPM Downloads can be found in https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.7.noarch.rpm
To install Elasticsearch on data nodes:
rpm -ivh elasticsearch-
Update the following properties in /etc/elasticsearch/elasticsearch.yml
index.number_of_replicas : 1
Note: Configure an initial list of master nodes in the cluster, if multicast is disabled
discovery.zen.ping.unicast.hosts: ["host1", "host2:port"]
Logging properties can be adjusted in
/etc/elasticsearch/logging.yml. The default log location is:
Download the Kibana binaries from the following URL https://download.elasticsearch.org/kibana/kibana/kibana-3.0.0milestone4.tar.gz
Extract archive with
tar –zxvf kibana-
config.jsfile under the extracted directory
elasticsearchparameter to the fully qualified hostname or IP of your Elasticsearch server.
index.htmlin your browser to access Kibana UI
and replace all occurences of
For demonstration purpose, lets setup and configure a Flume agent on a host where log file needs to be consumed with the following Flume configuration.
plugins.d directory and copy the Elasticsearch dependencies:
cp $elasticsearch_home/lib/lucene-core-*jar /usr/lib/flume/plugins.d
Update Flume configuration to consume a local file and index into Elasticsearch in logstash format. Note: in a real-world use cases, Flume Log4j Appender, Syslog TCP Source, Flume Client SDK, Spool Directory Source are preferred over tailing logs.
Create a file
/tmp/es_log.log with the following data
$KIBANA_HOME/index.html in browser. By default the welcome page is shown.
Click on “Logstash Dashboard” and select the appropriate time range to look at the charts based on the time stamped fields.
These screen shots show various available charts on search fields. e.g. Pie, Bar, Table charts
Content can be searched with custom filters and graphs can be plotted based on the search results as shown below.
||Mapreduce input and out formats are provided by the library|
||Hive SerDe implementation|
||Pig storage handler|
Detailed Documentation with examples related to Elasticsearch hadoop integration can be found in the following URL https://github.com/elasticsearch/elasticsearch-hadoop
discovery.zen.minimum_master_nodesshould be set to something like N/2 + 1 where N is the number of available master nodes.
action.disable_delete_all_indicesto disable accidental deletes
gateway.recover_after_nodesto appropriate number of nodes need to be up before the recovery process starts replicating data around the cluster.
index.fielddata.cache: soft to avoid OutOfMemory errors