Chapter 17. Solr Search

Drupal’s default search backend uses MySQL to implement some fairly advanced search capabilities. While this is fine for small sites and kind of impressive in its own right, it can prove quite a performance bottleneck for larger sites that contain many nodes and where the database may already be under moderate to heavy load. Luckily, Drupal’s core search system can be enhanced or completely replaced by contrib modules. Not only does this provide a way to offload search queries from MySQL, but it can also bring additional features that aren’t part of the traditional Drupal search. For example, searching with Solr provides faceted search functionality—a way to filter search results based on categories or groupings—and spellcheck, two widely used search features.

There are a number of popular open source search technologies, such as Elasticsearch, Solr, Sphinx, and Xapian. While all of these have Drupal modules, Solr is by far the most actively developed and widely used module, so we will focus exclusively on integrating Solr throughout this chapter. That is not to say that other search technologies aren’t as good as Solr; they simply haven’t been as well integrated into Drupal as Solr.

Performance and Scalability Considerations

On smaller Drupal sites, the search queries done by the default search module may not be particularly “heavy” SQL queries, but they do contribute to the overall load on the database server. On sites with a large enough data set, the queries can be downright performance killers. For this reason, Drupal’s built-in search should not be used on anything but the smallest of sites; it does not scale well enough for large sites. Search is one place where we can not only offload a task from the database server, but improve performance and scalability at the same time.

As discussed in Chapter 7 and the chapters on MySQL, MySQL is not something that can easily be scaled horizontally simply by adding more servers. On the other hand, pretty much every search-specific application has been designed to be able to scale horizontally. While most sites won’t have enough search traffic to necessitate a large search cluster, there certainly are sites that benefit from increased performance and easily scalable search.

Integrating Solr with Drupal

There are at least two modules that tie into Solr for search: apachesolr and search_api. They provide a similar set of features, either directly or by tying into other modules. Some of the most commonly used Solr features are:

Spellcheck
Solr can provide search results for words spelled similarly to those in the initial query.
More like this
This feature allows you to find similar content to a particular result; this can be useful for displaying a block on a page to link to other similar pieces of content.
Faceted search
This provides users with a way to filter search results—for example, limiting results to content updated within a certain time frame, or only by a particular author.

The main difference between the two modules is that search_api relies on the Drupal entity API, actually doing a full entity load for each result returned, whereas the apachesolr module retrieves more data from the index when doing a search, meaning that it can provide at least a title and teaser text without having to pull anything in from the database. There are some other small differences, but in general, the options are pretty similar. For our examples in this chapter, we’ll be using the apachesolr module, but the majority of the Solr setup remains exactly the same for either module.

Solr Configuration

Solr has a number of configuration files, many of which define settings that both the server and the client need to be aware of. For that reason, both the apachesolr and search_api modules include a set of configuration files that need to be used on the Solr server. The configuration files included in the module may vary over time, but the most important files included there are:

solrconfig.xml
This file includes Solr-specific configuration settings. You may need to tweak some of these values, but the provided config works well for the general case.
schema.xml
This file defines the schema used when storing documents. Since this defines which fields will be used for searching and storing, it’s important that whatever is used on your server matches what the module expects.
protwords.txt
This file is a sort of “blacklist” for words and stops them from being “stemmed.” In Solr parlance, this means that any word listed here won’t be considered part of a larger word. For example, by default “test” would normally be stemmed into other words such as “tester,” “testing,” and “tested,” but if it were listed here, that behavior would stop. The apachesolr module includes some HTML entities here in order to keep those from being stemmed.

Indexing Content

Solr contains its own search index that is totally separate from your site’s main content stored in MySQL. This means that an indexing job is required in order to load content out of the database and push it into the Solr search index. The apachesolr module tracks information on which content needs to be indexed in its own SQL table, and by default it will index a small number (currently the default is 50) of those items each time that the site cron job runs. There are also admin options and drush commands used to run the indexing job outside of the main cron run, to mark content for reindexing, and even to fully delete the Solr index.

Be aware that if you have a large amount of content on your site, it can take a long time (from a few up to even dozens of hours) to index it all the first time. Once content is indexed initially, only new and updated content will be indexed, so it happens much faster. The bottleneck here is generally not Solr, but the fact that the apachesolr module needs to do a full node_load() on content in order to pull out the information needed for the index.

Most sites can support indexing of many more than the default 50 items at a time, and you can usually improve the indexing time by increasing $conf[apachesolr_cron_limit] to something higher. Depending on the database and network infrastructure specifics, we’ve found the sweet spot for this setting to be somewhere from 100 to 1,000 items. Try a few different values in that range and see which performs best in your environment.

Warning

Solr indexing can be very resource intensive for a web node. Verify that your PHP memory limit and maximum execution time are high enough to avoid errors when indexing.

Infrastructure Considerations

Solr, as of version 1.4, has built-in replication that makes setting up multiple slave servers very easy. Because of that, there’s really no excuse not to run at least two Solr servers in a master/slave setup. Since most sites don’t need full high availability for Solr writes, usually they are fine with failing over to read-only on the slave as needed. Full HA with write failover is a bit harder to configure, though Solr 4.x attempts to address some of those shortcomings (leveraging Zookeeper).

Currently, Solr versions 3.x and 4.x are supported by the Apache Lucene project (and either will work with Drupal). A lot of sites are currently using 3.x versions because they don’t need the new features from 4.x, and the 4.x releases are young enough that some bugs are still being worked out.

When you download Solr, it includes a “built-in” Jetty server (a Java servlet engine with a built-in web server). This works fine for testing, but it’s not designed to be used in a production environment. The most popular options for a production-ready servlet engine are either using the full Jetty distribution, or using Tomcat. Either of those options will work fine with Solr, so it’s really just a matter of preference which you choose.

Solr Replication

We recommend starting with at least two Solr servers set up as master/slave. This will allow you to support at least read-only failover in the case that the master goes offline. The reason for having two servers is more about providing a failover than it is about having the option to load balance search queries—however, there’s no reason you can’t benefit from each. There are various choices for a failover mechanism; often we use IP failover controlled with Heartbeat, and if you set up Varnish on the Solr servers to direct traffic, it can easily be configured to filter out “write” queries to only go to the master server:

sub vcl_recv {
  if (req.url ~ "^/solr/[^/]+/(select|admin/ping)") {
       set req.backend = solr_server_pool;
   } else {
       set req.backend = solr_master;
       if (req.request == "POST") {
         return(pipe);
       }
       return (pass);
   }

   // rest of vcl_recv...

In this example, we send Solr select queries and requests for admin/ping to a Varnish director containing a pool of all Solr servers. All other traffic is forced to go to the Solr master server only. In the case of POST requests (index updates), we use a pipe in order to avoid timeouts should the update queries take a long time.

Enabling the built-in replication in Solr is fairly straightforward. If you set a few variables in the file solrconfig.properties located within a Solr core’s conf/ directory, they can then be referenced in solrconfig.xml within each Solr core configuration. Handling the configuration with variables and conditionals makes it possible for both master and slave to share the same solrconfig.xml file.

The following replication snippet from solrcore.properties shows the settings for the master server:

solr.replication.master=true
solr.replication.slave=false
solr.replication.pollInterval=00:00:60
solr.replication.masterUrl=http://solr-master-hostname:8112/solr

For the slave, simply swap the true/false values for the master and slave settings. The replication interval can be adjusted if you need to ensure that the slave receives updates faster than once per minute.

With those settings in place, the default solrconfig.xml that ships with the apachesolr module will handle replication for you, using the solr.replication.master and solr.replication.slave variables to conditionally enable master or slave behavior on each of the servers.

Drupal Module Installation

Download and install the apachesolr module. There are some additional modules listed there that may be desired for extra features. There’s no need to enable the module just yet, since we need to get Solr set up first.

Inside the module, you’ll find a directory, solr-conf/ that contains Solr configuration files for different versions of Solr: 1.4, 3.x, and 4.x. Any new installation should use 3.x or 4.x as 1.4 is already quite dated. Depending on which Solr version you decide to use, copy the configuration files from the corresponding directory here to your Solr server. They should be placed in /<path_to_solr>/<corename>/conf/.

Once you’ve copied those in place, you’ll need to (re)start your Solr service (e.g., Jetty or Tomcat) in order to recognize the configuration changes. Once that is running, you should be able to enable the apachesolr module and update its settings to point to your Solr URL. It will report if it’s able to successfully connect to the Solr server, and from that point, you can start indexing your content in Solr.

With the Solr server successfully up and running, the remaining module setup and configuration is fairly easy. Instead of covering it here, where it may quickly go out of date, we recommend reading through the apachesolr module’s project page and documentation, as well as the README.txt file that ships with the module.

Get High Performance Drupal now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.