All Posts

Integrating RabbitMQ and MongoDB to Maintain Solr Indexes in Multiple Data Centers

By Jaret Ward
/ August 14, 2024

Recently, I consulted for a large company that wanted to enhance their call center performance by improving their internal search experience. The existing system suffered from out-of-date data and slow query response times. The company was already using Oracle databases and Kafka queues across multiple data centers, and they were committed to utilizing Solr for this upgrade. There were several challenges to address. For starters, this was a cross-data center implementation, which Solr has historically struggled with and deprecated entirely in version 9.0. Then there was the sheer volume of documents that needed to be indexed and searched, numbering in the hundreds of millions. While Solr can handle this volume, the document load process was taxing on the Oracle servers, which were also supporting other high-priority operations. We needed to ensure that updating Solr indexes wouldn't interfere with these critical processes. Lastly, the application deployment had to be managed via Docker and Kubernetes, requiring applications to have well-defined purposes, easy deployment, and seamless communication without specific addressing requirements. So, to summarize, we needed an architecture that could:

Reduce demands on the Oracle servers
Transform the raw data into the document structure needed for search requirements
Apply Incremental Updates to Solr from Kafka queue messages
Unify data streams from disparate Kafka queues in multiple data centers
Share data across data centers so all Solr indexes were kept up to date

Proposed Solution

After several conversations with the client, they settled on an approach using a staging environment that could be populated once and maintained with the incremental updates delivered by the Kafka queues. There would be two independent but overlapping processes. The full load process would populate the staging environment with multiple data Loader applications, transform the accumulated data into the desired Solr friendly structure with a Transform application, and then trigger the Push applications in each data center to move the documents into the Solr indexes. The incremental process would use Kafka Listener applications to update the staging environment, then cause the Transform application to build the Solr friendly document based on ID, and finally trigger the Push applications in data center to push the updated ID.

MongoDB

MongoDB proved to be an ideal staging environment for storing documents. Its primary utility lies in providing replication services across data centers. Now, data loaders could populate the primary MongoDB instance and it would handle the data replication. Incremental updates would also be simpler. Any Kafka message could be published into the primary MongoDB instance and that, too, would be replicated to MongoDB in the other datacenters. Additionally, since MongoDB is NoSQL like Solr, input collections would reflect the central database structure, and an output collection would have documents structured similar to the search documents in Solr. By setting MongoDB's Write Concern to "majority", data replication across centers can generally be relied on at read time. Now we had an intermediate data store that took most of the load off the central database and acted as a data source for rapid concurrent loading of Solr instances in each of the data centers.

RabbitMQ

RabbitMQ solved the major hurdle of coordinating processes in custom applications deployed across the data centers. It provides a couple powerful tools for dealing with multiple data centers: federated queues, and federated exchanges. A federated queue distributes each message in such a way that it can be pulled one time from any application in any datacenter. So, for instance, we could have multiple instances a simple application for pulling messages off Kafka and dropping them onto a federated queue. Then another application could be made to pull a message off the RabbitMQ queue to be transformed and stored into MongoDB. Conversely, a federated exchange is used when you want to ensure a message needs to be processed once in each data center. In both cases, the message can be added to the exchange from any data center.

Final Design

The final design is shown in the following diagram:

As you can see, MongoDB replicated the Solr structured data across data centers and RabbitMQ facilitated communication between the applications. By using the queues, each application could simply tap into whatever data stream it needed to be aware of without any one applications needed to know about the specific existence of any other. This allows docker to start and stop applications at will without adverse effects.

The Result

This turned out to be an efficient solution that met all the requirements and the client was pleasantly surprised by how fast incremental updates would appear in the Solr indexes. Their initial expectation was that data be available for search within a 4-hour window. The actual response time for updates was measured in seconds, with a high enough throughput rate to easily keep up with thousands of updated per minute. Additionally, the full load process was able to perform faster than expected since the queries were kept simple and Spring Batch made multithreaded ETL highly configurable. This means a nightly update, although not needed, could be run to ensure data synchronization.

If you're looking for help with this topic or anything related to search, please check out our Search Solutions and Contact Us. We'd love to hear from you.