Indexing and Searching Documents with Apache Solr

In this tutorial, we will learn to:

  • Configure Solr to store indexes in HDFS
  • Create a solr cluster of 2 solr instances running on port 8983 and 8984
  • Index documents in HDFS using the Hadoop connectors
  • Use Solr to search documents


Step 1 – Log into Sandbox

  • After it boots up, find the IP address of the VM and add an entry into your machines hosts file e.g. sandbox    
  • Connect to the VM via SSH (root/hadoop), correct the /etc/hosts entry
  • If running on an Ambari installed HDP 2.3 cluster (instead of sandbox), run the below to install HDPsearch
yum install -y lucidworks-hdpsearch
sudo -u hdfs hadoop fs -mkdir /user/solr
sudo -u hdfs hadoop fs -chown solr /user/solr
  • If running on HDP 2.3 sandbox, run below
chown -R solr:solr /opt/lucidworks-hdpsearch
  • Run remaining steps as solr
su solr

Step 2 – Configure Solr to store index files in HDFS

  • For the lab, we will use schemaless configuration that ships with Solr
    • Schemaless configuration is a set of SOLR features that allow one to index documents without pre-specifying the schema of indexed documents
    • Sample schemaless configruation can be found in the directory /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs
  • Let’s create a copy of the sample schemaless configuration and modify it to store indexes in HDFS
    cp -R /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs  /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs_hdfs 
  • Open /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs_hdfs/conf/solrconfig.xml in your favorite editor and make the following changes:

1- Replace the section:

                <directoryFactory name="DirectoryFactory"


            <directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory">
                <str name="solr.hdfs.home">hdfs://</str>
                <bool name="solr.hdfs.blockcache.enabled">true</bool>
                <int name="solr.hdfs.blockcache.slab.count">1</int>
                <bool name="">false</bool>
                <int name="solr.hdfs.blockcache.blocksperbank">16384</int>
                <bool name="">true</bool>
                <bool name="solr.hdfs.blockcache.write.enabled">false</bool>
                <bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
                <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int>
                <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int>

2- set locktype to


3- Save and exit the file

Step 3 – Start 2 Solr instances in solrcloud mode

mkdir -p ~/solr-cores/core1
mkdir -p ~/solr-cores/core2
cp /opt/lucidworks-hdpsearch/solr/server/solr/solr.xml ~/solr-cores/core1
cp /opt/lucidworks-hdpsearch/solr/server/solr/solr.xml ~/solr-cores/core2
/opt/lucidworks-hdpsearch/solr/bin/solr  start -cloud -p 8983 -z -s ~/solr-cores/core1
/opt/lucidworks-hdpsearch/solr/bin/solr  restart -cloud -p 8984 -z -s ~/solr-cores/core2

Step 4 – Create a Solr Collection named “labs” with 2 shards and a replication factor of 2

/opt/lucidworks-hdpsearch/solr/bin/solr create -c labs \
-d /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs_hdfs/conf -n labs -s 2 -rf 2

Step 5 – Validate that the labs collection got created


Step 6 – Load documents to HDFS

  • Upload sample csv file to hdfs. We will index the file with Solr using the Solr Hadoop connectors
hadoop fs -mkdir -p csv
hadoop fs -put /opt/lucidworks-hdpsearch/solr/example/exampledocs/books.csv csv/

Step 7 – Index documents with Solr using Solr Hadoop Connector

hadoop jar /opt/lucidworks-hdpsearch/job/lucidworks-hadoop-job-2.0.3.jar com.lucidworks.hadoop.ingest.IngestJob -DcsvFieldMapping=0=id,1=cat,2=name,3=price,4=instock,5=author -DcsvFirstLineComment -DidField=id -DcsvDelimiter="," -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.CSVIngestMapper -c labs -i csv/* -of -zk localhost:2181

Step 8 – Search indexed documents



In this tutorial we have explored how to:

  • Store Solr indexes in HDFS
  • Create a Solr Cluster
  • Index documents in HDFS using Solr Hadoop connectors


October 14, 2015 at 10:54 pm


Am getting this error:

Diagnostics: Container [pid=120452,containerID=container_e02_1444801490747_0004_02_000001] is running beyond physical memory limits. Current usage: 185.0 MB of 170 MB physical memory used; 1.9 GB of 357.0 MB virtual memory used. Killing container.

– can you help on this


November 21, 2015 at 4:12 am

I am getting an error at Step 4

[solr@sclnxlap root]$ /opt/lucidworks-hdpsearch/solr/bin/solr create -c labs -d /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs_hdfs/conf -n labs -s 2 -rf 2
ERROR: create failed due to: Expected JSON response from server but received:

Error 404 Not Found

Problem accessing /solr/admin/info/system. Reason:

    Not Found

Powered by Jetty://

Typically, this indicates a problem with the Solr server; check the Solr server logs for more information.

    February 6, 2016 at 5:46 am

    I had the same problem and from the logs found that zookeeper had stopped.
    Restarted the Sandbox with 16GB mem and everthing worked.

December 17, 2015 at 8:58 am

Shall we also use lucene to make index directly into HDFS???
if possible then how???????

please suggest tutorials for same.


Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>