Overview
Thinking about Spring Batch as your framework for loading a Solr index? With some minor customizations, it won’t take much effort. Picking the correct library and understanding some Solr fundamentals will simplify the endeavor. If you’re new to Spring Batch, the code snippets provided here are intended to work with the examples in the Spring Getting Started guide. It will get you through the basics of creating a simple Batch project and cover basic Batch configuration, which will not be discussed here. Please refer to the Guide with questions regarding how to configure your data source, or more advanced documentation if you find a need to leverage Batch’s multithreading capabilities. The goal of this article is to examine how to integrate Solr as the ETL destination.
The Details
The first thing to do is add the Solr library to your project. If you do a quick search for “spring batch solr” you may find references to a Spring Data Solr project on spring.io — do not use this! Spring Data Solr was discontinued in 2020 and using it may cause issues. Instead, use the SolrJ library that is the same version as your installed Solr instance.
The next thing to know is Solr connections are thread-safe. This means defining your connection object as a Spring Bean will not pose any re-entrancy issues or block calls to Solr. Assuming the data source’s connection pool is large enough, Solr will effectively establish as many connections as are defined by the core-pool size property of the ThreadPoolTaskExecutor assigned to the Batch process.
Since the SolrJ library is being used, a Solr client should be made available as a simple bean in a configuration file, like so:
@Configuration public class SolrConfiguration { @Bean("solrClient") public CloudSolrClient solrClient(@Value("${solr.zk-servers}") String[] zkServers, @Value("${solr.connection-timeout}") int connectionTimeout, @Value("${solr.client-timeout") int socketTimeout, @Value("${solr.target-collection") String collection) { return new CloudSolrClient.Builder(List.of(zkServers), Optional.empty()) .withZkConnectTimeout(connectionTimeout, TimeUnit.MILLISECONDS) .withZkClientTimeout(socketTimeout, TimeUnit.MILLISECONDS) .withDefaultCollection(collection) .build(); } }
This configuration uses the SolrCloud client, which works with the SolrCloud deployment that is more common with current versions. The properties are passed in from the application.properties file directly to simplify configuration, with the most important being the list of ZooKeeper servers. In a normal deployment on a developer’s machine this will be localhost:2181. The collection must be a pre-existing collection in a deployed Solr instance. You could also use a standard HttpSolrClient to connect to a Solr deployment without going through ZooKeeper.
With the SolrJ library, you will need to create a custom ItemWriter implementation. ItemWriter is a simple interface used by the Batch process when it is ready to push a chunk of data into Solr. Here is an example dof a SolrItemWriter that makes use of the solrClient bean above:
@Slf4j @Component("solrWriter") public class SolrItemWriter implements ItemWriter<SimpleProduct> { private static final int COMMIT_WITHIN = 1000; private final SolrClient solrClient; @Autowired public SolrItemWriter(SolrClient solrClient) { this.solrClient = solrClient; } @Override public void write(Chunk<? extends SimpleProduct> chunk) throws Exception { Collection<SolrInputDocument> docs = new ArrayList<>(); for (SimpleProduct product : chunk) { SolrInputDocument doc = new SolrInputDocument(); doc.setField("id", product.getId()); doc.setField("name_s", product.getName()); doc.setField("description_s", product.getDescription()); doc.setField("price_d", product.getPrice()); docs.add(doc); } UpdateResponse resp = solrClient.add(docs, COMMIT_WITHIN); if (resp.getStatus() != 0) { throw new Exception("Solr was unable to update a chunk of documents."); } } }
This function iterates through a set of SimpleProduct objects that were created by the Batch process’ ItemReader (probably a JdbcCursorItemReader) and populates a Solr collection via the SolrClient. Updates must be commited. This can be done by including a commit within parameter to the add command (as shown above), adding a commit step to the Batch process, or through Solr configuration. Commits are necessary to make documents available for search.
Finally, the Batch configuration needs to include the SolrItemWriter in the final process by binding it into a Step to be included in the Job. The import step should look something like this:
@Bean public Step etlStep(JobRepository jobRepository, DataSourceTransactionManager transactionManager, JdbcCursorItemReader<SimpleProduct> jdbcCursorItemReader, SolrItemWriter solrWriter, @Value("${solr.chunk-size") int chunkSize) { return new StepBuilder("etlStep", jobRepository) .<SimpleProduct, SimpleProduct> chunk(chunkSize, transactionManager) .reader(jdbcCursorItemReader) .writer(solrWriter) .build(); }
You will need to make this step part of the Job definition. Please reference the Getting Started documentation to get a more complete understanding of how to configure the Job definition.
Conclusion
Configuring a Spring Batch job to import data into Solr is a convenient approach with a quick development cycle. Remember to use SolrJ and ensure the commits are updated and you should be able to search your index in short order. While you’re here, you may want to check out some of our other articles or our Solr consulting services to get the most out of Solr! Good luck!