Discover HDP 2.1: Apache Solr for Hadoop Search

Follow up on Apache Solr webinar

We recently hosted the fifth of our seven Discover HDP 2.1 webinars, entitled Apache Solr for Hadoop Search. Over 200 people attended the webinar, prompting an informative discourse.

The speakers outlined the Apache Solr overview and features, followed by a practical demo of how to process, index, search, and visualize server log data.

Thanks to our presenters Justin Sears (Hortonworks’ Product Marketing Manager), Rohit Bakhshi (Hortonworks’ senior product manager), and Paul Codding (Hortonworks’ Solution Engineer) who presented the webinar. The speakers covered:

  • Solr’s Advanced Full-Text Search Capabilities
  • Scalable Indexing of Data in HDFS
  • High Performance Indexing
  • Data Ingestion and Indexing
  • Search and Statistics Visualization

If you missed the webinar, here is the complete recording.

And here is the presentation deck.

Webinar Q & A

Question Answer
What’s the projected time frame to have Solr integrated with Hive? Are there any details on how this integration would look from a Hive user’s point of view? We are looking into getting the Hive integration done in the next few months.

It will look similar to the Pig integration, where you’ll be able to do output from Hive and create a Solr index.

When we have that integrated into HDP, we will have documentation to walk you through how to use it.

Where is the indexed data stored? There’re multiple storage options, and you can set those on a per-core basis. You can store documents on one core in HDFS and documents in another core can be stored on the local filesystem.

For the demo, we chose to store all indexes on a local filesystem. But we could have easily stored the indexed documents within HDFS.

Why would I use multi-node Solr? If the data is being stored in HDFS, the indexes are stored in HDFS as well. Why would I look at a Solr cloud? Search is trying to solve multiple problems.

We can store the Lucene indexes in Solr in a distributed fashion with HDFS.

But one of the big benefits of a Solr cloud is the ability to distribute not only storage data but also distribute the indexing process and the search process.

Just like Apache Hadoop is a distributed file system with distributed compute, Solr cloud offers both distributed indexing and search capabilities.

As our search needs grow and expand, as we have more users and more applications that hit these indexes, Solr cloud lets us scale that out

For the demo, how much data was pre-indexed and how much was done on demand? In this case, I had a series of files for one of the blogs that I help with—about 500 MB—comprised of the last three years of access logs.

I did a one-time index. I ran the Pig script once, which took about three minutes to index all of that data.

Once it was indexed, I could pivot and use the Banana tool to visualize that data.

How do you ingest data into Solr? Is that done in real-time or in batch? There are many ways to ingest data (depending on your use case).

For the type of log analysis that we showed in the demo, it would be more of a batch-oriented operation. I might have Flume on the edge of my network pulling the access logs in, and then schedule Oozie to do batch indexes at pre-scheduled intervals.

It all depends on how quickly you want the data to be searchable. That determines how you manage the ingest gaps.

Also, it depends on whether you want to index on a per-event basis with a streaming solution like Apache Storm or do it in batch (maybe hourly) with Apache Pig.

What version of Solr does Hortonworks support? How do I evaluate and learn more about HDP Search We encourage you to download the HDP 2.1 Sandbox.

We support Apache Solr 4.7.2. And there’s a “Searching Data with Apache Solr” tutorial that lets you run through an example and get a hands-on feel for the technology.

What’s the name of the tool you used for visualization? We used the Banana component of LucidWorks SiLK.

Here is the link to Banana on GitHub.

For developers, if you store indexes on HDFS, do you have to use different APIs to access indexes? Or does Solr provide an abstraction so that only one set of APIs is used, and then the configuration determines where to fetch the indexes? Apache Solr provides one set of query access APIs that users and applications use to submit queries.

Indexes can be stored on HDFS to provide a scalable, fault tolerant store for Apache Solr. Storage location of the indexes does not impact the API used to submit search queries to Apache Solr.

What’s Next?

Categorized by :
Data Analyst & Scientist New Features Search Solr

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

HDP 2.1 Webinar Series
Join us for a series of talks on some of the new enterprise functionality available in HDP 2.1 including data governance, security, operations and data access :
Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.
Explore Technology Partners
Hortonworks nurtures an extensive ecosystem of technology partners, from enterprise platform vendors to specialized solutions and systems integrators.