Search is Broken

How many times have we heard that yelled in our general direction? Search engines, and their teams, are only as good as the data they are given. This is true for all search engines, for all time, regardless of any machine learning algorithms or other bells and whistles they may contain. Still, many of us are guilty of using our search engines to clean, normalize, associate, and redirect terms to bridge the gap between the vocabulary of our users to the labels used by our business. The most common reason: no time or resources to do the data engineering required prior to loading the data into the searchable data store. The reality is that we’ll spend the time and resources somewhere, so why address the symptoms when you can cure the disease?

Clean it Up

You might receive data feeds from vendors or other external sources that contain all sorts of variation. Some examples include proper names (Jim:James), brand names (, Inc.), colors (pink:fuchsia), sizes (L:large), units of measure (in.: inch:“), and even simple things like case and special characters. Since basic search is based on matching tokens we need to put in the effort to normalize and expand those tokens as much as possible prior to making the data searchable. Why? Because that makes our jobs much more straightforward for the important part of search which is making sure that our consumers can find what they are looking for. Having a handle on our business and product attribution gives us a huge advantage when attempting to match user queries with relevant results.

The Importance of Language

So, what do we mean by bridging the gap between the vocabulary of our users to the labels used by our business? It simply means this: all of the time we spend crafting taxonomies and product attribution means absolutely nothing if the consumers don’t use the same terms. For example, let’s say we’re a clothing company that uses hip, marketable terms for colors like ‘fuchsia’ and ‘lavender’. No amount of search engine magic is going to retrieve those values if the customer searches for pink or purple; the same is true for those searching for rosado or violeta. Language matters. Should we require our customers to select a little flag somewhere in the UI to select their language or should we just respond to what they’re asking. Let’s ask the experts:

Mixed languages? Awesome! It turns out that many companies only translate part of their data into multiple languages. This is particularly true for key brand names and styles.

Hey! Doesn’t Machine Learning Solve That?

According to search engine sales people, yes, but here’s what they don’t tell you: machine learning models are built from your data! If you don’t have control over your data event flow and the resources and expertise to populate, tune and publish those models over time, in varying contexts, then your models can actually do more harm than good. Architectural decisions also play a role here. Here’s a picture from Alex Wang, on LinkedIn, that sums this up perfectly:

Controlling Data Event Flow

We all know that websites and mobile applications keep track of our interactions, but this is not automatic. Someone has to put the code hooks in place to do this. Those hooks typically lead to the capture and storage of meaningful events like:

  • search terms
  • filters (refinements, faceted navigation)
  • results lists
  • page views
  • add-to-cart
  • checkout
  • […]

The list is as long as the business requirements driving the data capture.

Business Requirements?

Depending on consumer traffic on your sites, the event data we’re talking about can result in millions of records in a matter of minutes. Data capture should be strategic and iteratively provide value. For example, if we want to build a model that will eventually learn that a search for purple shorts should bring up those amazing fuschia tennis shorts we’ve just spent millions of dollars advertising, then we’re going to need to know what our customers are searching for and if they eventually found and made a purchase.

Event Context

The event context is directly related to the consumer experience. That said, if you change the experience, can you keep using models built previously? Think of it this way if you learn the steps and patterns to ride a bicycle, can you immediately ride a motorcycle? Some things translate but you’re probably going to need a few more bits of information to be successful. The same is true for the data event flow; if you’re experimenting with the user experience (which is great!) just make sure you trace the variations in your event model.

Business Silos

One of the things that makes event data capture a significant undertaking is that most organizations have separate teams responsible for each event. Sometimes, even search and browse are not implemented in the same group and they have no idea who the PDP or checkout teams are. The data has to work together to provide value, so do the teams collecting it; figuring out a way to work across team silos is critical to success.

How to Build It

We know one thing for sure: while we build something new, we need to keep serving our customers. Has ‘lift and shift’ ever worked? After over 50 ecommerce implementations, adding or replacing search engines, I can honestly say not really. Shoe-horning new technology into an old interface, for the reason “we don’t want our customers to notice the difference” has always mystified me. The amount of technical debt introduced and the performance implications were always challenging. Is there a better way? Absolutely. Will it cost more up front? Yep. Will it cost less and provide more value in future? Definitely. Here’s the best visual example of the solution I’ve ever seen:

This is California’s Bay Bridge. The new bridge was built next to the old bridge to keep the traffic flowing to and from San Francisco. As sections were completed, a portion of the traffic was diverted and analyzed, when successful, that section would open and the old section would close. When the project was completed, the old bridge was removed. Now, take a look at the engineering difference between the two bridges and imagine what would have happened if that same design, stability, and all of the other features were superimposed onto the old bridge? Still, that is exactly what almost everyone attempts to do with their websites and mobile applications while changing the underlying technology. Just think about that for a while. Imagine being able to work on something brand new without the constant fear and stress of breaking the current implementation.


You can take short-cuts with many things, but data engineering is not one of them, especially when it comes to search implementations. Here are the basic steps to succeed:

  • Establish clear data cleansing, normalization, and expansion (i.e. translations, synonym generation) processes prior to loading searchable data
  • Establish a clear data event model feedback loop to enable analysis, experimentation, and model building capabilities
  • Build iteratively in parallel: deploy, test, measure, evaluate, decide (continue/change)

Remember: if they can’t find it, they can’t buy it.