Solr Document Matching: Precision Hits or Casting a Wide Net?

Solr document matching

Overview

Computers are precision instruments. People are... not. Setting up a search index isn't just about indexing data into Solr and tying a search box to it. Generally speaking, getting the right results means being able to find a document based on what a person searches for and then prioritizing the document they are most likely to want. This is known as matching (or getting hits) and relevancy, respectively. The goal of this article is to survey some tips and tricks for indexing data in ways that make it easier to establish matches. We'll be working from simpler concepts like data quality down into more technically intensive topics such as vector fields. Relevancy, while extremely important, is a topic for another day.

Can You Have Too Much of a Good Thing?

Matching documents to queries is about finding the balance between limiting the results to only what you want, and allowing more results than you want in order to ensure the desired result comes back. This evaluation is known as precision vs. recall. On one end of the spectrum, you can set up a query and Solr configuration to return the exact document you want back by searching against a unique Id field. This is very precise, but you have to know precisely what you're looking for. The other end of the spectrum is a query that returns every document in Solr. This ensures the document being sought is in the result set, but does nothing to help you pick out the proverbial needle.

Absolute precision and total recall are legitimate use cases, but most searching happens somewhere in between. When configuring Solr, you will want to ask if and when it's ok to not return a document, or to return unrelated documents. For instance, let's assume there are three documents, two tied to "John Smith" and one to "Jon Smith", and I expect someone to search for "John Smith". Do I want to return "Jon Smith" knowing it could in some cases be unrelated and in other cases being a desired result by a person who doesn't know the correct spelling? Or do I want it to be excluded for the "John Smith" search? In one case, the search is more precise but risks not returning the desired document. In the other case, the risk is an extra document coming back when it may be irrelevant to the person doing the search because they are sure of the spelling. The answers to these kind of questions will guide the decisions you make in configuring how to match documents.

Data Quality is Important!

So, how does Solr deal with the gray area between precision and recall? In fact, Solr provides numerous approaches for loosening up the precision of the matching process. However, one of the first things to consider is the quality of the data going into Solr. If the data is bad, incomplete, or just not very helpful, it may be impossible to properly match it to a query. You may find that some data curation goes a long way. For instance, you may want "grass seed" to appear when someone searches for "lawn care", but your "grass seed" products have neither "lawn" nor "care" in any descriptive fields. Although Solr synonyms can be used to assist in this kind of match (for instance, tell it "lawn" can be interpreted as "grass"), simply adding "lawn care" to a searchable field may be the simplest approach with the least unintended consequences. Of course, the down side to curating data is that it can be time consuming, requires a deeper understanding of the data, and reindexing of documents will be necessary as curation changes.

Synonyms!

I bet you knew about synonyms before you even got to this post. I'm here to tell you, yeah, they can be valuable. However, because they operate against every value in a field, they need to be considered both in how Solr uses them to land desired hits and how they may result in undesired matches. It may seem like a great idea to put "green, turquoise" in the synonyms file, but then Solr can't differentiate between the two values at all. So a user may be trying to specifically search for "turquoise" but end up getting a list of "green" items. Worse, what if it's a first_name field and you use it to match "John" and "Jon"? This is a disadvantage for people who know the exact spelling for the person they are looking up, even if it helps someone who searches for "John" but should be searching for "Jon". Is that a trade you want to make? Usage of and entries in Synonym files should be carefully considered and well tested against various use cases.

Hitting the Motherlode with Schema Configuration

The Solr schema defines the fields and how they process data at query and index times. Query time is when a search is run, whereas index time is when the data is added to Solr. If you run through the Five Minutes to Searching tutorial, you will be able to see the default schema at localhost:8983/solr/#/techproducts/files?file=managed-schema.xml. Documentation on the schema file can be found here. Field and dynamicField elements establish what fields can be searched on or are simply stored. Fields are bound to Field Types through their "type" property which shares a value with the "name" property on the corresponding "fieldType" element.

Field analysis converts indexed data and user queries into terms, and then evaluates the terms generated by the user query and the loaded data for matches. If your eyes glazed over while reading that sentence, it's totally understandable. There's a lot to unpack there. First, Solr doesn't compare "words", it compares "terms". Terms might exactly match the word being processed, which is the case for any field where type="string", or it might do something simple like convert the word to "lowercase" to allow for case-insensitive comparisons. It might also split a chunk of data into seperate tokens based on dashes, then strip out all characters that aren't alpha-numeric, then convert letters to lower case, then further split letter and number groups into seperate terms. It's possible you might actually want it to do all that if you have a part id that looks like "AC143-313bdx1%501" and wanted to be able to match on "ac143", "1501", "bdx" or "313". Unfortunately, if a field is configured this way, it will also match on any part with the id "135-ac-2" or "523-ac-1". I would suggest trying to keep the length of terms as high as possible while still managing user expectations. The problem with the example I provided is the risk of creating an "ac" term, which creates way too many false positive matches. This is a case where you need to know your data and your use cases to understand not only when a query will fail to match, but also when it will match too many documents.

Let's examine the text_general type from the default configuration:


	  <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
	    <analyzer type="index">
	      <tokenizer name="standard"/>
	      <filter words="stopwords.txt" ignoreCase="true" name="stop"/>
	      <filter name="lowercase"/>
	    </analyzer>
	    <analyzer type="query">
	      <tokenizer name="standard"/>
	      <filter words="stopwords.txt" ignoreCase="true" name="stop"/>
	      <filter synonyms="synonyms.txt" ignoreCase="true" expand="true" name="synonymGraph"/>
	      <filter name="lowercase"/>
	    </analyzer>
	  </fieldType>
As you can see, there are two analyzers. The first, "index", controls how the field value is processed as Solr is populated, and the second, "query", controls how the query is processed before it is compared to the terms stored in a document's field. The analyzer chains are similar, which should be expected. Both of them use the Standard Tokenizer, which splits the value on spaces, tabs, and most non-alpha-numeric characters. Then they compare each term to the values listed in the stopword.txt file, and finally both of them set all letters to lowercase to permit case insensitive searching. Imagine if, for instance, the query analyzer forced values to lower case, but the index analyzer did not. The result could very easily make it impossible to match on that field if all indexed values were, in fact, in upper case. It makes sense for both analyzers to be similar, or even identical, which is why the analyzer can be specified without a "type" property, making it the analyzer for both index and query time processing.

I'm sure you spotted the synonyms filter in the query analyzer. Synonym files are bound to specific fields, which does make them a bit more useful since you don't need to worry about cross-contamination between, say, a "title" field and a "brand" field. It's usually better practice to maintain source data in its original form on indexing since it preserves data integrity and changes to a synonym file bound to the index analyzer will require a full reindexing of the data where as changes to a query analyzer's synonyms file become effective immediately.

I advise caution when working directly with field analysis, but not so much caution that you avoid it as a tool altogether. It's powerful and sometimes necessary. Assuming you went through the tutorial, there is a useful tool for testing field analyzers built into the Solr web interface at localhost:8983/solr/#/techproducts/analysis. To alleviate the risk of overanalyzing a field, you might consider using copyFields.

Also, it's useful to remember the goal is not to match a specific field, the goal is to match the document. It's easy to fall into a trap of trying to make one field handle every scenario. To that end, it's good practice to maintain the original field values in stored fields and use copyFields to replicate the original values into one or more searchable fields. This permits the values to be represented in more than one way depending on which fields are being searched, and all of those different field value representations can be included into an eDismax field list or a Standard Query Parser query simply by referencing them in the "qf" parameter. For instance, in the "AC143-313bdx1%501" example above, I could have one searchable field that only preserves the text before the first dash and sets it to lower case, another field that strips out the "%" symbol, separates out numeric groups and stores them as terms, and a third field that only preserves letter groups that have a length of 3 or more. There are still possibilities for false positive hits on the document, but far fewer while maintaining a high likelihood of matching on all part numbers that share a similar format.

Machine Learning? AI? Sure!

Known as vector search, Solr supports LLM and AI technologies using Dense Vector Fields and k-nearest neighbors (KNN) queries. At the risk of underselling the complexity of deriving vectors from models such as SBERT or LLAMA, Solr allows you to add the vector embeddings in the same way it does any other document data field. Solr utilizes vector models to compare query vectors against document vectors, measuring similarity through cosine similarity scores. This enables the efficient ranking of documents based on their relevance to a given query vector. To support this, you will need to configure one or more knn_vector fields with dimensions to match the vector embedding size generated by the LLM and use the rest of the field type settings to control how vectors are compared.

As you can see, Solr has the capability and flexibility to support traditional keyword search (aka sparse vectors) and vector search (aka dense vectors) as just described. Another approach, called hybrid search, incorporates the strengths of both approaches to achieve the best of both worlds. Please stay tuned for another article that will discuss this topic in depth.

Final Thoughts

A lot of ground was covered here on what Solr is capable of, but a lot of the choices in this kind of work are dependent on use cases. Understanding the differences between an indexed document's values and how a person might search for it is critical. This is because people do not always know exactly what they are looking for, nor do they know how it is represented in the data. On top of that, misspelled words and typos are common occurrences. If your search configuration doesn't attempt to account for these factors and more, it can lead to an unproductive and frustrating experience. Curating your data, fiddling with field analysis, incorporating vector search will never solve 100% of your use cases perfectly and sometimes you'll need to determine when it's ok to have unwanted documents return in your results. That's a situation that can often be mitigated with relevancy tuning, so you may find it's less important to limit your unwanted matches than it is to ensure the important matches are made.