Introduction
Search is often presented as a straightforward but difficult problem of finding the most relevant items based on a user's query. The user has some need, and the goal is to find the best items to fill that need. However, not all search problems are quite so direct.
Set Design, the task of putting together environments for film or other visual media, operates a little differently. The goal is to create a cohesive set of items that fulfill multiple roles and visually communicate information about the setting. A scene might call for a couch, multiple chairs, an end table, and several additional decorative elements. Not only do these items need to match the overall setting for the scene, but they should together convey a message about the space and the people who created it. This means that it’s not just about finding the best chair, but the chair that matches the style of the couch and the table. Changing one element might make another no longer suitable as the scene direction changes. To reflect this, a search needs to take into account not just the type of the requested item (say lamp) but also the overall scene and the items that have already been selected for it.
This project explored several different approaches to scene-oriented retrieval using OpenSearch, vector embeddings, and multimodal search techniques. The goal was not simply to retrieve individually relevant items, but to explore how modifications to traditional search workflows could be used to iteratively build a cohesive scene more effectively than repeatedly searching for isolated items.
First Approach: Biasing Retrieval Using Selected Items
The first and most basic approach was to allow a user to select items for the “scene” and then influence future searches based on those selected items. This was accomplished by using the embedding vectors from those items to modify the query vector prior to searching. By using a weighted combination of the query vector and the item vectors, this creates a search that is partway between normal vector search and a “more like this” item based search on the selected items. For the proof of concept, a manual slider was used to adjust the balance between query and selected items.
Results
This approach successfully biased the results based on selected items, but had a significant drawback. Too many similar items (say elements of furniture) could partially or completely override the search itself. As an example, if the user is setting up a “victorian” scene and has already selected a chair and a couch, a search for a lamp will be adjusted not just in a “victorian” direction, but in a “furniture” direction and may show additional couches and chairs instead of lamps. Reducing the influence of selected items could mitigate this, but also removed much of the benefit.
Second Approach: Introducing Positive and Negative Examples
The second iteration introduced the ability to select “negative” items: examples of bad results that the search should be steered away from. The hope was that adding negative items would mitigate the bias towards the same type of the items (like chair) already selected and instead push towards the “style” of the selected items (like “college dorm room”) instead.
Results
This second approach yielded a surprising benefit: The use of negative examples made it easier to find good candidate items even when no other items were selected for the scene. For a generic search like “chair” there are too many possible options to guarantee that an item the user is interested in will be in the results. Normally the search path would be to add additional descriptors to the search (“victorian chair” or “office chair”) or to apply filters. Adding the capability of negative selection allows an additional path: move away from undesirable results while still exploring the whole space. For a user who has a style in mind but may not know the industry language for it (or when language doesn’t exist in the data) it’s easier to select undesirable examples and see a new tailored list. An example might be removing folding chairs and wheeled office chairs to focus more on cushy upholstered chairs.
The combination of positive and negative examples proved to be very powerful for single searches and small numbers of selected items, providing meaningful and useful stylistic influence on the query. For larger numbers of selected items, however, it proved unwieldy. It’s reasonable to expect users to choose different numbers of negative examples when searching for different items. By lumping everything together, the balance between positive and negative items was lost for large sets and many of the problems from the first phase appeared again. As an example, if a user selected 1 positive and 3 negative items for chairs and 1 positive and 1 negative item for lamps, the averages would give a net negative towards chairs and a net positive towards lamps, skewing search results for the next item.
Third Approach: Moving from Search to Scene Construction
To improve on the previous approach, the user interface was modified away from having a simple search box with global selected and negative item lists, and toward a more full scene selection tool with an overall theme and separate searches for each desired item.
This allowed separate tuning of how much each factor (item specific search term and positive/negative selected items, overall theme, and positive/negative examples from other items) affected the query.
Results
This approach, combined with some tuning, fixed the issue with similar items overriding the search term. The influence of the overall theme combined with selected products provided a meaningful and useful tailoring of search results towards the intended style.
Areas for Further Exploration
There are several avenues still to pursue to get a complete picture of how a system like this would work best in production. Evaluation by domain experts would allow for testing more subtle patterns and similarity between items. Fine-tuning models on the data set itself might provide higher fidelity to capture those patterns and improve performance over the general purpose models used for the demo. User experience testing would also help ensure that interactions around selecting and managing items has a natural feel.
Summary
Selecting items for a film scene presents challenges that aren’t fully addressed by typical search systems. The effectiveness of the scene requires all of the selected items to work together and not just be relevant individually. A modified search workflow that is more interactive and context-aware by allowing users to select examples of both desirable and undesirable items greatly improved the experience of putting together a complete scene. Indeed, even when a user isn’t putting together a group of items, the use of negative selection can still help a user narrow down by a sense of “taste” that may not have a good textual description for narrowing by conventional search (or if it does, the user may not know the right terms). While there’s more work to be done around user interaction and custom tuned models, this project showed a novel approach that greatly improved the user experience for this class of group-based searching, and showed significant promise for certain “style” based searches as well.
Retrieval Architecture & Implementation Details
This section contains more of the technical details of the project setup and is intended primarily for the curious and technically minded.
The data for this project contains ~82,000 items of various types including chairs, couches, tables, lamps, books, knick knacks, and other items that might show up in a movie scene. The items each have a description that typically includes what type of item it is along with descriptions of its style, time period, materials, and other associated information. An example would be:
“GLOBE SHADE: Vintage Pressed Amber Satin Glass, Tapered Cylinder Shape, Reeded Sides Look Like Book Spines, 3-1/8'' Fitter, 6''W x 6''D x 5-1/2''H” Most of the documents have images, but some are missing them. For the purposes of image-based search the items with missing images were ignored.
Data was processed and loaded into OpenSearch. Several different ways of searching were set up and reviewed. Details are included below. For this project, hybrid search combining OpenAI vector embeddings on the descriptions and OpenCLIP embeddings on the images was found to be the most useful and was the primary search method for the results. Lexical search wasn’t included in the Hybrid search setup since there is no way to perform the vector averaging to bias results towards those already selected.
- Lexical
- Several fields including description and category were indexed for text-based BM25 search.
- Only minimal tuning was performed on this search. Basic field weighting and text analysis. No synonyms or domain-specific processing was performed.
- Basic Vector
- Long Description was embedded using the all-MiniLM-L6-v2 model from HuggingFace.
- Queries were performed using OpenSearch HNSW search using the lucene engine and cosine similarity.
- OpenAI
- Long description was embedded using the text-embedding-3-large model from OpenAI.
- Queries were performed using OpenSearch HNSW search using the lucene engine and cosine similarity.
- Image
- Item images (when available) were embedded using OpenClip and the hf-hub:laion/CLIP-ViT-g-14-laion2B-s12B-b42K model from HuggingFace.
- Queries were performed using OpenSearch HNSW search using the lucene engine and cosine similarity.
- Hybrid
- Hybrid Queries combined the above Lexical, OpenAI, and Image searches using the OpenSearch hybrid search pipeline.
- A normalization Processor of “min-max” and a combination technique of “arithmetic_mean” were used for the combination.