cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button

Indexing and Searching Documents with Apache Solr

In this tutorial, we will learn to:

  • Configure Solr to store indexes in HDFS
  • Create a solr cluster of 2 solr instances running on port 8983 and 8984
  • Index documents in HDFS using the Hadoop connectors
  • Use Solr to search documents

Pre-Requisite

Step 1 – Log into Sandbox

  • After it boots up, find the IP address of the VM and add an entry into your machines hosts file e.g.
192.168.191.241 sandbox.hortonworks.com sandbox    
  • Connect to the VM via SSH (root/hadoop), correct the /etc/hosts entry
ssh root@sandbox.hortonworks.com
  • If running on an Ambari installed HDP 2.3 cluster (instead of sandbox), run the below to install HDPsearch
yum install -y lucidworks-hdpsearch
sudo -u hdfs hadoop fs -mkdir /user/solr
sudo -u hdfs hadoop fs -chown solr /user/solr
  • If running on HDP 2.3 sandbox, run below
chown -R solr:solr /opt/lucidworks-hdpsearch
  • Run remaining steps as solr
su solr

Step 2 – Configure Solr to store index files in HDFS

  • For the lab, we will use schemaless configuration that ships with Solr
    • Schemaless configuration is a set of SOLR features that allow one to index documents without pre-specifying the schema of indexed documents
    • Sample schemaless configruation can be found in the directory /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs
  • Let’s create a copy of the sample schemaless configuration and modify it to store indexes in HDFS
    cp -R /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs  /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs_hdfs 
    
  • Open /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs_hdfs/conf/solrconfig.xml in your favorite editor and make the following changes:

1- Replace the section:

                <directoryFactory name="DirectoryFactory"
                class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}">
                </directoryFactory>

with

            <directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory">
                <str name="solr.hdfs.home">hdfs://sandbox.hortonworks.com/user/solr</str>
                <bool name="solr.hdfs.blockcache.enabled">true</bool>
                <int name="solr.hdfs.blockcache.slab.count">1</int>
                <bool name="solr.hdfs.blockcache.direct.memory.allocation">false</bool>
                <int name="solr.hdfs.blockcache.blocksperbank">16384</int>
                <bool name="solr.hdfs.blockcache.read.enabled">true</bool>
                <bool name="solr.hdfs.blockcache.write.enabled">false</bool>
                <bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
                <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int>
                <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int>
            </directoryFactory>

2- set locktype to

<lockType>hdfs</lockType>

3- Save and exit the file

Step 3 – Start 2 Solr instances in solrcloud mode

mkdir -p ~/solr-cores/core1
mkdir -p ~/solr-cores/core2
cp /opt/lucidworks-hdpsearch/solr/server/solr/solr.xml ~/solr-cores/core1
cp /opt/lucidworks-hdpsearch/solr/server/solr/solr.xml ~/solr-cores/core2
/opt/lucidworks-hdpsearch/solr/bin/solr  start -cloud -p 8983 -z sandbox.hortonworks.com:2181 -s ~/solr-cores/core1
/opt/lucidworks-hdpsearch/solr/bin/solr  restart -cloud -p 8984 -z sandbox.hortonworks.com:2181 -s ~/solr-cores/core2

Step 4 – Create a Solr Collection named “labs” with 2 shards and a replication factor of 2

/opt/lucidworks-hdpsearch/solr/bin/solr create -c labs 
-d /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs_hdfs/conf -n labs -s 2 -rf 2

Step 5 – Validate that the labs collection got created

Image

Step 6 – Load documents to HDFS

  • Upload sample csv file to hdfs. We will index the file with Solr using the Solr Hadoop connectors
hadoop fs -mkdir -p csv
hadoop fs -put /opt/lucidworks-hdpsearch/solr/example/exampledocs/books.csv csv/

Step 7 – Index documents with Solr using Solr Hadoop Connector

hadoop jar /opt/lucidworks-hdpsearch/job/lucidworks-hadoop-job-2.0.3.jar com.lucidworks.hadoop.ingest.IngestJob -DcsvFieldMapping=0=id,1=cat,2=name,3=price,4=instock,5=author -DcsvFirstLineComment -DidField=id -DcsvDelimiter="," -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.CSVIngestMapper -c labs -i csv/* -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -zk localhost:2181

Step 8 – Search indexed documents

Image

Summary

In this tutorial we have explored how to:

  • Store Solr indexes in HDFS
  • Create a Solr Cluster
  • Index documents in HDFS using Solr Hadoop connectors