Finding the Right Tune with Solr and Machine Learning

Overview

Recently, a client engaged us to implement More Like This (MLT) search with their Solr platform to make it more efficient for their customers to locate music. The music they find would be used with movies, television shows, commercials and similar productions. At a high-level, customers want to find tracks that fit within, and enhance, their existing content.


In the current workflow, customers would find one track that interested them using search terms or attributes, then they would want to find similar tracks to compare or complete their search. This gives some insight into the user’s process. First, if a user is searching for similar items, they already found something they are interested in. The user may be satisfied with what they found, and simply want to confirm their choice (upgrading if a better option shows up). They may also like some aspects of what they found, but not others. Because of this, More Like This search is a balancing act: too similar and it’s unhelpful; too different and it doesn’t belong. This is especially important when dealing with music, since users may differ greatly on what music is “similar” and listening to tracks takes time and energy. Building a MLT implementation that supports most users is difficult.


To tackle these challenges, it’s important to focus on what the user wants. If they like the track they found, they either want to find a slight improvement, or they want to see they made the right choice. If they like some parts of a track, but not all, they are probably looking for a track that has the elements they like, but none of the ones they don’t. Both of these approaches benefit from diversity in the results. A MLT search for a Beatles track that pulls back only Beatles tracks may leave a user wondering what other artists they may have missed. Pulling back different versions of the same song (different recordings, covers, remixes, etc.) may or may not be relevant to the user. To get a diverse (but relevant) set of tracks back we use clustering over a broader set of documents.


This is Machine Learning. However, like all machine learning it’s not magic. It will not give the “correct” answer unless there is data to back it up, and the algorithm can find it. If you only have track name and artist it cannot group songs by mood, and a bag of words model cannot decide by phrases.


The Solution

The first step in grouping related tracks is a notion of similarity. A Similarity Metric is a way to compare tracks that results in a numerical score. Similarity metrics should always be domain specific (you wouldn’t compare songs the same way you compare foods or appliances), and should be carefully tuned by experts where possible. If the metric you have is symmetric (A->B gives the same result as B->A) and the scores “look good” that should be all you need to start.


For dealing with music, the similarity metric we used didn’t represent position (knowing A->B and B->C doesn’t give us any concrete information about C->A). This limits the available options for clustering, since we can’t compute cluster centroids for kmeans or calculate density for density based clustering. We do have a fixed number of clusters (the number of tracks to return). However, we must also have a representative track for each (since the goal is to return 1 representative track per cluster).


To work with these parameters we decided on a Hierarchical Clustering approach with distance between clusters calculated by distance between cluster prototypes. Since each cluster will always have a prototype, it can act as the representative track to return. Hierarchical Clustering with prototypes also ensures that tracks are only compared pair-wise with no “position” required. It also contains no random elements (such as seeds in Kmeans) so it will always cluster the same tracks the same way.


Brief Hierarchical Clustering algorithm outline:
  • To start, each individual track is considered its own “cluster” of size 1.
  • Repeat the following until the desired number of clusters is reached:
    • Find the two most similar clusters (by prototype distance) and combine them into a single cluster.
    • Use sum of similarities across the cluster to determine a new cluster prototype.

This basic formula was a good start, but it required a few tweaks to work as expected. First, by setting a limit on which documents are considered for clustering, we reduce the number of initial points (which improves performance) and ensure that all candidate documents pass similarity standards. This should prevent entirely unrelated tracks from being suggested. However, just because track B and C are highly similar to A doesn’t necessarily mean they’re highly similar to each other. Depending on your similarity metric it’s possible for one track that is similar to A to be dissimilar to all other tracks highly similar to A. Because of this, the algorithm would never merge it into another group, guaranteeing that it will be returned to the user. Since the goal is to return representative tracks and not unique tracks, a change needed to be made. First, by giving a baseline similarity score to pairs of tracks not otherwise similar we allow dissimilar tracks to potentially be merged. Then, we weight similarity scores by cluster size to encourage small clusters to be combined. While this will break “representativeness” for some clusters, it helps preserve it for others, and one cluster of “weird but similar” tracks might still be helpful to the user.


One additional feature of this clustering is that it enables a “guided search” experience. A user finds a track they partially like, and are prompted with a set of optional tracks. They can pick a few they like and perform the clustering again. By limiting the initial set of tracks to only those highly similar to all the tracks the user picks, the initial pool becomes more focused, and so the responses from clustering will be tailored more towards their interests. This is especially helpful with music, since users may not know what about a song they like, but may be able to identify it in similar songs. By quickly isolating which aspects they want (and to a limited extent, which they don’t) it’s possible to get to a highly relevant set of results without forcing the user to dig through hundreds of tracks or analyze exactly what they are looking for.


Results

To illustrate how the clustering works in practice, I’ll use the similarity matrix for a small set of actual data. Here higher numbers represent greater similarity.


To work with this small example we will assume that only 3 tracks can be displayed, since the number of potential tracks (small in this example for clarity) is usually much greater than the number shown. Using a conventional MLT approach, the returned tracks for Bright Futures (b) would be: Bright Futures (sting), Forever and a Day (a 60), and Forever and a Day (a). Clustering gives results: Bright Futures (sting), Forever and a Day (a), Ready Steady Go (a 30).


First, both of these approaches include the alternate track version. This is somewhat inevitable given it’s high similarity score to the seed track (it’s alternate version) and its low similarity to any of the other tracks. However, given additional tracks, the clustering will combine it with others, especially any additional alternate versions.


Second, showing both versions of Forever and a Day in the conventional MLT search is a waste. A user can easily find the alternate versions after deciding they like one, and given their similarity they are unlikely to be so different that the user will like only one of the two. Showing any other track (even only moderately relevant) would allow more user finds something they are interested in.


Third, these results may give the user a false sense of limitation. Seeing only one track other than what they searched for (discounting alternate versions) may lead them to believe there aren’t many similar tracks and abandon their search. Items highlighted in yellow are all highly related to each other, so showing any of these songs will likely spark the user’s interest if they were interested in any of them. The user could then either explore within the cluster (assuming the infrastructure is in place to allow this), or do a further MLT search on the new track.


Conclusion

Traditional More Like This search can provide results that are too narrow. Given the wide array of users interests in music (and many other fields), many users won’t have a path forward. By clustering results, we can support a much greater variety of use cases. Guiding users quickly to the area they’re interested in instead of trapping them within the bubble they already found.


If this is an issue you’re facing and would like some help, we’d love to hear from you. Please feel free to contact us here.