Discover HDP 2.1: Apache Solr for Hadoop Search
We recently hosted the fifth of our seven Discover HDP 2.1 webinars, entitled Apache Solr for Hadoop Search. Over 200 people attended the webinar, prompting an informative discourse.
The speakers outlined the Apache Solr overview and features, followed by a practical demo of how to process, index, search, and visualize server log data.
Thanks to our presenters Justin Sears (Hortonworks’ Product Marketing Manager), Rohit Bakhshi (Hortonworks’ senior product manager), and Paul Codding (Hortonworks’ Solution Engineer) who presented the webinar. The speakers covered:
- Solr’s Advanced Full-Text Search Capabilities
- Scalable Indexing of Data in HDFS
- High Performance Indexing
- Data Ingestion and Indexing
- Search and Statistics Visualization
If you missed the webinar, here is the complete recording.
And here is the presentation deck.
Webinar Q & A
|What’s the projected time frame to have Solr integrated with Hive? Are there any details on how this integration would look from a Hive user’s point of view?||
We are looking into getting the Hive integration done in the next few months.
It will look similar to the Pig integration, where you’ll be able to do output from Hive and create a Solr index.
When we have that integrated into HDP, we will have documentation to walk you through how to use it.
|Where is the indexed data stored?||
There’re multiple storage options, and you can set those on a per-core basis. You can store documents on one core in HDFS and documents in another core can be stored on the local filesystem.
For the demo, we chose to store all indexes on a local filesystem. But we could have easily stored the indexed documents within HDFS.
|Why would I use multi-node Solr? If the data is being stored in HDFS, the indexes are stored in HDFS as well. Why would I look at a Solr cloud?||
Search is trying to solve multiple problems.
We can store the Lucene indexes in Solr in a distributed fashion with HDFS.
But one of the big benefits of a Solr cloud is the ability to distribute not only storage data but also distribute the indexing process and the search process.
Just like Apache Hadoop is a distributed file system with distributed compute, Solr cloud offers both distributed indexing and search capabilities.
As our search needs grow and expand, as we have more users and more applications that hit these indexes, Solr cloud lets us scale that out
|For the demo, how much data was pre-indexed and how much was done on demand?||
In this case, I had a series of files for one of the blogs that I help with—about 500 MB—comprised of the last three years of access logs.
I did a one-time index. I ran the Pig script once, which took about three minutes to index all of that data.
Once it was indexed, I could pivot and use the Banana tool to visualize that data.
|How do you ingest data into Solr? Is that done in real-time or in batch?||
There are many ways to ingest data (depending on your use case).
For the type of log analysis that we showed in the demo, it would be more of a batch-oriented operation. I might have Flume on the edge of my network pulling the access logs in, and then schedule Oozie to do batch indexes at pre-scheduled intervals.
It all depends on how quickly you want the data to be searchable. That determines how you manage the ingest gaps.
Also, it depends on whether you want to index on a per-event basis with a streaming solution like Apache Storm or do it in batch (maybe hourly) with Apache Pig.
|What version of Solr does Hortonworks support? How do I evaluate and learn more about HDP Search||
We encourage you to download the HDP 2.1 Sandbox.
We support Apache Solr 4.7.2. And there’s a “Searching Data with Apache Solr” tutorial that lets you run through an example and get a hands-on feel for the technology.
|What’s the name of the tool you used for visualization?||
We used the Banana component of LucidWorks SiLK.
Here is the link to Banana on GitHub.
|For developers, if you store indexes on HDFS, do you have to use different APIs to access indexes? Or does Solr provide an abstraction so that only one set of APIs is used, and then the configuration determines where to fetch the indexes?||
Apache Solr provides one set of query access APIs that users and applications use to submit queries.
Indexes can be stored on HDFS to provide a scalable, fault tolerant store for Apache Solr. Storage location of the indexes does not impact the API used to submit search queries to Apache Solr.
- Visit our What’s New in HDP 2.1 page and the Apache Solr page to learn more.
- Attend our next Discover HDP 2.1 webinar on Thursday, June 19 at 10 am Pacific Time: Apache Storm for Stream Data Processing
- And if you have any further questions pertaining to Apache Solr—documentation, code examples, tutorials—please post them on the community forums under Solr.