Beyond Keywords: Mongo’s Guide to Effective Hybrid Search

Mongo Hybrid Search

The crux of hybrid search lies in how you combine vector search results with keyword (aka lexical) search results to maximize the benefits of each. Each of these searches ranks results, which means they provide some kind of information on how relevant each result is to the query provided. Unfortunately, the relevancy schemes operate in incompatible ways. In this blog, I'll be breaking down how Mongo addresses this fundamental disparity in their How to Perform Hybrid Search tutorial. First, let's take a quick look at the advantages of each of these search techniques and how they rank their results.


Vector search has come to the forefont of applied data science as a result of the rise of Large Language Models (LLMs). One of the more intriguing things LLMs provide, from a search perspective, is the ability to account for context. So, if I'm searching for "Tee Shirt", a vector search might consider it to be like any kind of summer clothing, such as a blouse or cap. Or, it might see it as similar to an article of informal clothing such as shorts. Depending on how it's trained, it might possibly contextualize it as something commonly sold as merch at an event such as a concert, like a sticker. The advantage of a vector search is the person doing the search doesn't need to know the "right" word to find what they are searching for. However, this can also lead to results that are completely outside normal expectations. Ultimately, LLMs convert documents and queries into vectors (more easily thought of as points in space). For our purposes, relevancy is generally calculated as the distance between points (Euclidean) or the difference between the vector angles from origin (cosine).


This is completely unlike lexical search which is based on term and document frequency or the more powerful Okapi BM-25 ranking function. Lexical searches require the query and documents to have terms that specifically match. If I search for "Tee Shirt" in a lexical index, at least one of those terms must appear somewhere in the document. Some configuration work can provide a certain level of flexibility in how queries and fields are processed into terms. This may allow partial matches like returning results that contain just "Tee" or "Shirt". It might also include manually configured synonyms, so "blouse" might be treated the same as "shirt", but if you don't include a synonym, it can't make an inference. Configuration might also search accross fields and return results that aren't immediately obvious. These results usually come back in what is commonly known as the long tail. So you might search for "Tee Shirt" and get back a deck chair because a description field includes something like "hanging out on your deck in summer drinking a lemonade in a light shirt". Relevancy is based on some formulae that take into account factors ranging from how common a term is across all documents, how many other terms there are in a field that includes the query term, configured field boosting, and a raft of other factors.


Hybrid search seeks to overcome the fundamental disparity in relevancy calculations to create a cohesive result set from vector and lexical searches based on the same (or similar) queries. While there is no "correct" solution in this space as yet, there are several approaches. MongoDB provides one suggested approach in their tutorial How to Perform Hybrid Search we can use to illustrate how to address some of the obstacles in merging vector and lexical search results. In their tutorial, they set up an example to return the top 10 results. For simplicity, the examples here will work toward acquiring the top 5. The results will be for a theoretical search for "tee shirt".


The first major question you will be confronted with when developing a hybrid search is, "What can serve as the basis of comparison?" Instead of basing it on the scores provided by each search, which are incompatible with each other, Mongo develops a separate scoring process based solely on the respective positions of documents in each set of results. They start by querying by vector and by term (lexical) and storing the results by rank. This brings us to an important point, for this solution to work, the queries need to return more results than the desired output. I will explain why this is as we step through the solution. Mongo's example doubles the desired result set count, and there are advantages to higher over-counts (and some disadvantages). For simplicity, we'll go with 7.

Rank
1
2
3
4
5
6
7
Vector Results
Tee Shirt
Jersey
Pants
Blouse
Belt
Cap
Sticker
Lexical Results
Tee Shirt
Golf Tee
Blouse
Dress Shirt
Casual Shirt
Deck Chair
Cotton Shirt

At this point, you could choose to shuffle the results together, but you run into a couple issues. First, you would be assuming the different search techniques give results of equal quality. This rarely being the case, Mongo suggests applying "penalties" to the different results. For instance, if I thought the lexical results were generally a little more useful, I might apply a penalty of 2 to the vector results, resulting in the following Adjusted Ranks:

Rank
1
2
3
4
5
6
7
8
9
Vector Results
Tee Shirt
Jersey
Pants
Blouse
Belt
Cap
Sticker
Lexical Results
Tee Shirt
Golf Tee
Blouse
Dress Shirt
Casual Shirt
Deck Chair
Cotton Shirt

Ok, but what about the duplicates? Handling duplicates is one of the main reasons for oversearching, since some spots are likely to overlap. How you handle the overlap is important. You could simply choose the highest position, so if we simply pushed the results together now, we might see this in a top 5 search:

Rank
1
2
3
4
5
Consolidated Results
Tee Shirt
Golf Tee
Blouse
Jersey
Dress Shirt

However, in the Mongo example, the results get a score related to their rank (1 divided by the Adjusted Rank), and then scores are added together. Here you can see how each rank is assigned a score with the Mongo adjustment.

Rank
1
2
3
4
5
6
7
8
9
Mongo Score
1.00
0.50
0.33
0.25
0.20
0.17
0.14
0.13
0.11
Vector Results
Tee Shirt
Jersey
Pants
Blouse
Belt
Cap
Sticker
Lexical Results
Tee Shirt
Golf Tee
Blouse
Dress Shirt
Casual Shirt
Deck Chair
Cotton Shirt

The top 5 with weights here could come out in either of these lists depending on the sort since "Blouse" would be ⅙ + ⅓ = ½ and "Golf Tee" also has a total of 0 + ½ = ½. You will also see that "Tee Shirt" has a highly inflated score since it is now 1 + ⅓, ensuring it maintains the top spot.

Rank
1
2
3
4
5
Mongo Score
1.33
0.50
0.50
0.25
0.25
Option 1
Tee Shirt
Golf Tee
Blouse
Jersey
Dress Shirt
Option 2
Tee Shirt
Blouse
Golf Tee
Dress Shirt
Jersey

This brings us to the reason why you you might choose to penalize both techniques in order to prevent the first couple of positions from overpowering the others by such a large margin. By upping the vector penalty to 3 and the lexical penalty to 1, "Blouse" scores just under ⅖ and "Golf Tee" will score ⅓. This guarantees "Tee Shirt" sorts higher in the results.

Rank
1
2
3
4
5
Mongo Score
0.75
0.39
0.33
0.20
0.20
Consolidated Results
Tee Shirt
Blouse
Golf Tee
Jersey
Dress Shirt


Ideally, "Golf Tee" would score low enough to drop off the list, but because tee is an unusual word it ranked particularly high due to the BM-25 score. Alternatively, a vector search might also score an unwanted result particularly high because it is not tuned sufficiently for your data or the query is not precise enough. You can see Jersey and Dress Shirt also maintain identical scores. This represents a potential advantage to a higher result count on the initial query. What if, for example, you expanded the initial queries to return 10 items and Dress Shirt did show up at position 9 in the vector results? This would cause it to score higher than Jersey. However, this may not be a desirable outcome, so querying twice as many results as you intend to display from vector and lexical searches may or may not produce a superior ranking.


Having explored Mongo's approach to hybrid search, you should keep in mind other approaches might work better for you. For instance, if your first page just returns 30 results to a user every time and your vector results aren't particularly well trained, you might choose to use slots 13 through 16 to display the top 4 vector results. Or, you may want a more even distribution, but with more weight placed on the vector results. For this, you could set it up to display a lexical result in every fourth slot. Regardless which approach you apply, avoiding duplicates and choosing which search to favor will be a primary focus.


In some ways, Mongo's technique is a product of the tool they have available (Aggregation Pipelines). Creating a hybrid approach is, well, personal. It will depend on what's available to you and how reliable you feel your query sources are. Hybrid search is a field we take very seriously here at Innovent and we expect it to be a point of interest for the foreseeable future. In truth, Mongo's algorithm is a fairly straightforward general solution that may serve well in many use cases. However, we believe there is always a little more that can be done to improve result quality, and we'll be looking out for every opportunity.


If you're looking for help with this topic or anything related to search, please check out our Search Solutions and Contact Us. We'd love to hear from you.