Searching Data with Apache Solr


In this tutorial we will walk through how to use Apache Solr with Hadoop to index and search data stored on HDFS. It’s not meant as a general introduction to Solr.

After working through this tutorial you will have Solr running on your Hortonworks Sandbox. You will also have a solrconfig and a schema which you can easily adapt to your own use cases. Also you will learn how to use Hadoop MapReduce to index files.  


1. Hortonworks HDP Sandbox 2.1

2. Apache Solr 4.7.2

3. Lucidworks Job Jar

Remarks: I was using VMware’s Fusion to run Sandbox. If you choose Virtualbox things should look the same beside the fact your VM will not have it’s own IP address but rather Solr listening on For convenience I added sandbox as a host to my /etc/hosts file on my Mac. Apache Solr 4.7.2 is the officially by Hortonworks supported version as I’m writing this (May 2014). 


Let’s get it started: Power-up the sandbox with at least 4GB main memory.

    ssh root@sandbox (Passwort: hadoop)

Open your browser and verify that all services are running. We will only need HDFS and MapReduce but “all lights green” is always good ;-) 

We start by creating a solr user and a folder where we are going to install the binaries:

    adduser solr
    passwd solr

    mkdir /opt/solr
    chown solr /opt/solr

Now copy the binaries you downloaded from the list of ingredients above from your host to the Sandbox from your Mac / Windows host:

    cd ~/Downloads
    scp solr-4.7.2.tar lucidworks-hadoop-1.2.0-0-0.tar solr@sandbox:/opt/solr

Next step is creating dummy data we will later on index in Solr and make searchable. As mentioned above this is “Hello World!” so better do not expect big data. The file we are going to index will be four line csv file. Type the following on your Sandbox command prompt:

    echo id,text >/tmp/mydata.csv; echo 1,Hello>>/tmp/mydata.csv; echo 2,HDP >>/tmp/mydata.csv; echo 3,and >>/tmp/mydata.csv; echo 4,Solr >>/tmp/mydata.csv

Then we need to prepare HDFS:

    su - hdfs
    hadoop fs -mkdir -p /user/solr/data/csv 
    hadoop fs -chown solr /user/solr
    hadoop fs -put /tmp/mydata.csv /user/solr/data/csv

Now it’s getting more interesting as we are about to install Solr:

    su - solr
    cd /opt/solr

    tar xzvf solr-4.7.2.tar 
    // untar is all we need to install Solr! We still need to integrate it into HDP though. 
    tar xvf lucidworks-hadoop-1.2.0-0-0.tar
    ln -s solr-4.7.2 solr
    ln -s lucidworks-hadoop-1.2.0-0-0 jobjar

Solr comes with a nice example which we will use as a starting point:

    cd solr
    cp -r example hdp 
    // Remove unnecessary files:
    rm -fr hdp/examle* hdp/multicore
    // Our core (basically the index) will be called hdp1 instead of collection1
    mv hdp/solr/collection1 hdp/solr/hdp1
    // Remove the existing core
    rm hdp/solr/hdp1/

Now comes the most difficult part: Making Solr storing its data on HDFS and creating a schema for our “Hello World” csv file. We need to modify two files solrconfig.xml and schema.xml

    vi hdp/solr/hdp1/conf/solrconfig.xml

Search for the tag


And completely replace it with (Make sure you copy the full line. The lines may appeart truncated in the browser but when you copy/paste the full lines you’re good):

    <directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory">
      <str name="solr.hdfs.home">hdfs://sandbox:8020/user/solr</str>
      <bool name="solr.hdfs.blockcache.enabled">true</bool>
      <int name="solr.hdfs.blockcache.slab.count">1</int>
      <bool name="">true</bool>
      <int name="solr.hdfs.blockcache.blocksperbank">16384</int>
      <bool name="">true</bool>
      <bool name="solr.hdfs.blockcache.write.enabled">true</bool>
      <bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
      <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int>
      <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int>

Now still in solrconfig.xml look for lockType. Change it to hdfs:


Save the file and open schema.xml

    vi hdp/solr/hdp1/conf/schema.xml

In the


tag keep only the fields with the following names:


Leave the dynamic fields unchanged (they could be useful for your own use-cases but we will not need them in this example though). 
Add the following fields:

    <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
    <field name="text" multiValued="true" stored="true"  type="text_en" indexed="true"/>
    <field name="data_source" stored="false" type="text_en" indexed="true"/> 

The data_source field is required by the map reduce based indexing we will use later. The fields named id and name are matching the two columns in our csv file. 
Next remove all copyField tags and add:
Lets Add the id to text so we can search both

<copyField dest="text" source="id"/>

Now we need to create our core/index. Start solr and point your browser to it (http://sandbox:8983/solr):

    cd hdp
    java -jar start.jar

Click on “Core Admin” and fill in the fields as below:

If everything goes as expected you should see the following:

If something is broken (xml file non parseable, wrong folder…) you can easily start from fresh by:

    // stop or kill solr
    rm /opt/solr/solr/hdp/solr/hdp1/
    hadoop fs -rm -r /user/solr/hdp1
    // start solr again

Now choose the just created core “hdp1” from the dropdown box on the left:

Click on Query and press the blue “Execute Query” button. You will see that we still have 0 documents in our index which is no surprise as we have not indexed anything: 

So now we are going to index our big csv file ;-)

    hadoop jar jobjar/hadoop/hadoop-lws-job-1.2.0-0-0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -DcsvFieldMapping=0=id,1=text -cls com.lucidworks.hadoop.ingest.CSVIngestMapper -c hdp1 -i /user/solr/data/csv/mydata.csv -of -s http://localhost:8983/solr

If everything went well your output should look like:

    14/05/24 06:46:00 INFO mapreduce.Job: Job job_1400841048847_0036 completed successfully
    14/05/24 06:46:00 INFO mapreduce.Job: Counters: 32
        File System Counters
            FILE: Number of bytes read=0
            FILE: Number of bytes written=201410
            FILE: Number of read operations=0
            FILE: Number of large read operations=0
            FILE: Number of write operations=0
            HDFS: Number of bytes read=287
            HDFS: Number of bytes written=0
            HDFS: Number of read operations=4
            HDFS: Number of large read operations=0
            HDFS: Number of write operations=0
        Job Counters 
            Launched map tasks=2
            Data-local map tasks=2
            Total time spent by all maps in occupied slots (ms)=16727
            Total time spent by all reduces in occupied slots (ms)=0
            Total time spent by all map tasks (ms)=16727
            Total vcore-seconds taken by all map tasks=16727
            Total megabyte-seconds taken by all map tasks=4181750
        Map-Reduce Framework
            Map input records=5
            Map output records=4
            Input split bytes=234
            Spilled Records=0
            Failed Shuffles=0
            Merged Map outputs=0
            GC time elapsed (ms)=146
            CPU time spent (ms)=3300
            Physical memory (bytes) snapshot=295854080
            Virtual memory (bytes) snapshot=1794576384
            Total committed heap usage (bytes)=269484032
        File Input Format Counters 
            Bytes Read=53
        File Output Format Counters 
            Bytes Written=0

Go back to your browser and enter “HDP” in the field called “q” and press “Execute Query”: 


You installed and integrated Solr on HDP. Indexed a csv file through map reduce and successfully executed a Solr query against the index! 

Next steps are now installing Solr in SolrCloud mode on an HDP cluster, index real files and create a nice web app so that business users can easily search for information stored on Hadoop. 
I hope this was useful and you had fun!


Vinayak Agrawal
October 14, 2014 at 1:31 pm

While running this tutorial, I get the following error at the step “So now we are going to index our big csv file” :

Exception in thread “main” java.lang.UnsupportedClassVersionError: JVMCFRE003 bad major version; class=com/lucidworks/apollojj/common/jackson/ApolloModule, offset=6
at java.lang.ClassLoader.defineClassImpl(Native Method)
at java.lang.ClassLoader.defineClass(
at java.lang.ClassLoader.loadClass(
at java.lang.ClassLoader.loadClass(
at java.lang.J9VMInternals.verifyImpl(Native Method)
at java.lang.J9VMInternals.verify(
at java.lang.J9VMInternals.initialize(
at com.lucidworks.hadoop.ingest.AbstractIngestMapper.(
at com.lucidworks.hadoop.ingest.CSVIngestMapper.(
at java.lang.J9VMInternals.newInstanceImpl(Native Method)
at java.lang.Class.newInstance(
at com.lucidworks.hadoop.ingest.IngestJob.main(
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at org.apache.hadoop.util.RunJar.main(

My java -version is
java version “1.7.0”
Java(TM) SE Runtime Environment (build pxa6470sr5-20130619_01(SR5))
IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 20130617_152572 (JIT enabled, AOT enabled)
J9VM – R26_Java726_SR5_20130617_1436_B152572
JIT – r11.b04_20130528_38954ifx1
GC – R26_Java726_SR5_20130617_1436_B152572_CMPRSS
J9CL – 20130617_152572)
JCL – 20130616_01 based on Oracle 7u25-b12

Blair Krotenko
September 11, 2014 at 10:22 am

Thanks for this tutorial, it was very helpful.

I do want to point out that to index the csv file I needed to change the command from “hadoop jar jobjar/hadoop/hadoop-lws-job-1.2.0-0-0.jar…” to “hadoop jar lucidworks-hadoop-lws-job-1.3.0.jar…” since the LucidWorks version change.

Other than that, great tutorial.

August 4, 2014 at 2:04 am

when i use the IK , the stop-dic and ext-dic can not be used normally~~,how to place the dic to the correctly path~ thank you~

hung chen
July 28, 2014 at 3:26 pm

Do you have instruction on building SolrCloud for HDP 2.1?

Sagar Prasad
July 28, 2014 at 4:42 am

@navdeep, you can add an entry for port 8983 in network, it worked for me.

Jim Cheung
June 18, 2014 at 12:30 pm


thanks for your clear tutorial
I have trouble indexing the csv file

when i run the command:

hadoop jar jobjar/hadoop/hadoop-lws-job-1.2.0-0-0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -DcsvFieldMapping=0=id,1=text -cls com.lucidworks.hadoop.ingest.CSVIngestMapper -c hdp1 -i /user/solr/data/csv/mydata.csv -of -s http://localhost:8983/solr

i got the following exception:

Exception in thread "main" invalid distance code
at org.apache.hadoop.util.RunJar.unJar(
at org.apache.hadoop.util.RunJar.unJar(
at org.apache.hadoop.util.RunJar.main(

Do you have any suggestion on how to fixing it?

Thanks :)

navdeep agrawal
June 16, 2014 at 6:10 am

i am trying to load solr through pig latin using lucidworks hadoop connector but unable to load solr .can you please share some code you described in one of your webinar .

June 4, 2014 at 12:54 am

i am following above tutorial
i am having trouble in connecting my browser to apache solr(http://sandbox:8983/solr) i am using,but unable to connect is i am missing some thing .i am using default solr.xml

in logs started socketconnector@

June 4, 2014 at 12:52 am

i am having trouble in connecting my browser to apache solr(http://sandbox:8983/solr) i am using,but unable to connect is i am missing some thing .i am using default solr.xml

in logs started socketconnector@

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>