A Tale of Java, LLMs, and Search

Overview

Innovent, being a technology company in the search space, is constantly looking for ways to improve our search products and services. AI and LLMs show great promise in improving user experience in search, so we decided to put together a POC where simple web based calls could be used to get results from common models such as SBERT, OpenAI, and Gemini. Generally speaking, Java is my primary programming language, although I’m also adept with JavaScript and have experience with several other languages. This is all to say, when learning how to develop with LLMs, my first impulse was to find a Java based solution. This was somewhat disappointing. While there are Java libraries to support LLM integration, they don’t lend themselves to “hello world” level programming. Try searching for “java llm” and you end up seeing a github repository listing filled with a bunch of names that require specialized knowledge and limited documentation on how to leverage them. So, research into other options was required. Here is a quick review on where that research led and how to create your own small web server to generate search results from an LLM.

What You Need from the LLM

Let’s take a moment to discuss what is needed from an LLM to support search. The first thing to know is, an LLM base web server can be built without understanding how LLMs work. It doesn’t hurt to have that knowledge, but it’s like trying to understand battery chemistry to use your TV remote. All that is required from the LLM is a way to convert a term, sentence, document, or query into a vector. That’s it. That’s what an LLM provides. Of course, the LLM vectors will need to be rapidly correlated to one another and ranked based on distance or relevance in order to be useful. This sort of algorithm is referred to as a KNN or nearest neighbors search.

If Not Java…

Obviously, interpreting a model and doing KNN searches are just some kind of algorithm. Certainly, Java can do it. However, it turns out the best libraries for it are written in Python. This may be because most of the research done into AI, ML, and LLMs was done using Python. It’s vital to accept this, particularly for creating a simple POC. This will make life much easier. Creating a Python application that does the work and presents provides the vector results through a REST interface offers a simple bridge to Java applications. The entirety of the sample code explored in this document can be found here, and provides a /load endpoint to read all the .json files in a directory into a vector database and a /query endpoint.

Python Service Components

Another thing found while doing LLM research is a lot of references to Hugging Face. In fact, many pretrained models are available in Hugging Face format. One of the simpler methods for working with these models is to leverage the langchain libraries. Additionally, a library will be needed for storing the vectors in a searchable database. For this, the Chroma library appears to be something of a standard. Finally, the POC application will need a framework to provide REST endpoints, such as the FastAPI library. Establishing the imports in the file headers should look something like this (after adding a couple other helper imports):

	from fastapi import FastAPI
	from pydantic import BaseModel
	from langchain_core.documents import Document
	from langchain_huggingface import HuggingFaceEmbeddings
	from langchain_chroma import Chroma
	import chromadb
	import os
	import json

A big part of learning Python is setting up the environment. The snippet above will throw errors until you add the fastapi, langchain, langchain-huggingfact, and langchain-chroma packages to the Python environment. IDEs such as IntelliJ IDEA provide utilities for doing this, but library management tools such as pip can also be used directly. Other installation tools such as Homebrew or Conda also work, but pip is generally recommended for Python packages.

About Chroma

Chroma is a vector database that can be bound to pretrained LLMs and provides methods for running the KNN search mentioned above. As such, it is the model layer of a simple Python LLM service. The fundamental process for Chroma is configure it, load it, query it.

Configuring Chroma

This code snippet is all the configuration needed. Since the application is written in Python, it is not wrapped up in a function. It will reside just below the imports listed above.

	embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
	persistent_client = chromadb.PersistentClient()
	collection = persistent_client.get_or_create_collection(
	    name="docs",
	    metadata={"hnsw:space": "cosine"})
	db = Chroma(
	    client=persistent_client,
	    collection_name="docs",
	    embedding_function=embeddings
	)
The metadata sets up chroma to use cosine similarity search instead of Euclidean distance. Cosine similarity is a better technique for text relationships. The embedding function is provided by the HuggingFaceEmbeddings class wrapped around the all-MiniLM-L6-v2 pretrained model. The model used here is based on SBERT, which makes it good for sentence and text similarity searches.

Loading Chroma

As with any data store, Chroma must be populated with records that can be returned. Chroma permits storing text samples, but documents tend to be a more useful format since they can include metadata such as ids which can be returned in the query response. The metadata can be used to cross-reference the results with a relational database or a lexical search index (such as OpenSearch). This small snippet is part of the /load service endpoint in the sample code. It loads a document array and submits it to Chroma through the db.add_documents call:

    for i in json_root:
        doc_list.append(Document(
            page_content=i[text_field],
            metadata={"id": str(i[id_field])},
            id=str(i[id_field])
        ))
        id_list.append(str(i[id_field]))
    db.add_documents(documents=doc_list, ids=id_list)
 

Querying Chroma

Chroma provides the similarity_search_with_score() function to query from the documents that have been loaded. In the sample code it is used in the /query endpoint and looks like:

	results = db.similarity_search_with_score(q, count)
In this line, q is the query and count is the number of documents to return. By returning a set of ids and scores, this query function creates a result set that is ready to be merged with results from another search index as part of a Hybrid Search operation.

Wrapping It Up

To be honest, I’m not a fan of Python. Java and C++ are much more comprehensive programming languages with better support for most operations. However, Python is clearly easier for working with pretrained LLM models currently. Until Java catches up to this simplicity of use, running a Python service is the best option for a POC, and it can be done in a scant 70 lines of code. Good luck in your efforts at leveraging LLMs!