Semantic search faceting
In which I learn how to use Solr facets to evaluate suggested searches before presenting them to users
Evaluating suggested queries
I am in my first weeks at my new job at Cornell. I am working on a product in which a list of suggested searches appears. Only some of these suggested searches would produce any actual results if chosen by a user. Cornell does not want a user to attempt a suggested query only to see no results. So, I need a way to remove zero-result searches from the list.
Furthermore, I may also want to deprioritize search suggestions with with very few results.
To accomplish both these things, I need a count of how many results each search suggestion will return.
How to count results
Getting one count is easy enough:
?q=actor &rows=0
The q
param is the main query, the search suggestion I'm evaluating. In this example it is a simple search for the string "actor". The rows
param set to zero removes unneeded results output, since all we need is a count. This is given in the numFound
portion of the output:
"response":{ "numFound":17, "start":0, "docs":[] }
This is useful, but I'll need to re-run it for every suggested search term. If the list is long, I would be making a lot of query requests. This would be unacceptably slow as a part of a UI.
Combining queries with facets
Likely it would be faster to present multiple queries to Solr, all in the same http request. At least network connection time would saved. And maybe we can reduce query execution time?
There is no way (that I'm aware of) to combine available queries in the same request. But each query that I want is a count. And there is a way to get multiple counts: facets.
?q=*:* &facet=on &facet.query=actor &facet.query=bard &facet.query=caroler &rows=0
This query asks for three counts: the number of records (or documents) that contain the string "actor", the number with "bard", and the number with "caroler". The (truncated) output looks like this:
"response":{ "numFound":232, "start":0, "docs":[] }, "facet_counts":{ "facet_queries":{ "actor":17, "bard":3, "caroler:5" } }
Now the output contains integers representing the count for each search term. Why is the numFound
higher than any of the individual counts, and even greater than the total of all the counts?
Because the query contains q=*:*
which queries for all values in all fields. This query finds all documents available to Solr. The numFound
reflects this. The facet counts represent subsets of this group of all documents.
Image of eclipse by:
Luc Viatour, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=1107408
Thanks to Shalin Shekhar Mangar for introducing me to this aspect of Solr facets.