newsletter

Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

AVAILABLE NEWSLETTERS:

Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
October 22, 2018
prev slideNext slide

Enterprise Search with HDP Search

We are excited to announce the immediate availability of HDPSearch 4.0. As you are aware, HDP Search offers a performant, scalable, and fault-tolerant enterprise search solution. With HDP Search 4.0, we have added the following new features:

  • Upgraded Apache Solr to Solr 7.4. Apache Solr 7.x offers several new features such as autoscaling, replication modes, etc.
  • Updated Apache Hadoop Solr connectors to support HDP 3.0
    • HDFS (updated to support Hadoop 3.1.0)
    • Apache Hive (updated to support Apache Hive 3.0)
    • Apache Spark (updated to support Apache Spark 2.3.1)

Note: HDP Search 4.0 will only work with HDP 3.0 because the connectors have been updated to support the corresponding component versions in HDP 3.0.

Solr 7.x includes the following news features:

  • Autoscaling – Today, cluster management (provisioning, splitting shards, moving replicas, etc.) is largely manual. With the auto-scaling feature, you can automate collection and data/load redistribution operations via autoscaling APIs. It aims to keep the SolrCloud cluster balanced and stable in the face of various cluster change events. This balance is achieved by satisfying a set of user-configurable rules and sorting preferences to select the target of cluster management operations automatically on cluster events.
  • Replication Modes – Two new types of replicas have been added, named TLOG & PULL to provide flexibility to the existing SolrCloud replica model.

Please refer to the Apache Solr 7.4 reference guide for more information on the new Solr 7.x features.

Below shows how you can use HDP Search 4.0 with the updated spark connector to write and read data from Solr:

Step 1: Writing data to Solr

Below curl command will create a spark collection named “sparkcollection” with 2 shards and 1 replica as seen in the Solr UI below:

curl -X GET  “http://<solr_host>:8983/solr/admin/collections?action=CREATE&name=sparkcollection&numShards=2&replicationFactor=1”

The CSV file used here is located here. Move the CSV file to the /tmp HDFS directory and read it as a Spark DataFrame as shown below.

Index this data to Solr using the command: http://<solr_host>:8983/solr/sparkcollection/update?commit=true

Run a query on Solr UI to validate the setup:

*:* returns all 999 docs indexed

Query for a particular pickup location returns one document

Step 2: Reading data from Solr

Read from Spark (tip and fare):

Read from spark (total amount and toll amount)

Similarly, you can index and search data stored in Hive and HDFS as well by using the corresponding connectors.

Use the below resources to learn more about HDPSearch 4.0:

 

 

Please stay tuned for future blogs on HDP Search!

Leave a Reply

Your email address will not be published. Required fields are marked *