Posted on by & filed under analytics, learning.

Content recommendations, or “related links”, are a common feature on content-rich sites.  For example, while editing this post I  am being presented with a set of “related content” links provided by a partner of WordPress. These may be driven either editorially, or by search, or by using linguistic document similarity measures, or really by any relationship among content items. Editorial recommendations can be useful, since they tend to be well-informed, but may be difficult to maintain on a large site; they aren’t really appropriate when there are more than a few hundred items on offer. Search-based recommendations based on document similarity — which may be driven by subject classifications or other metadata, or even full text statistics, and scale up better to larger document sets — but tend to offer low-relevance suggestions without careful tuning.

We like recommendations based on observed user behavior. Our catchy buzz-phrase for it is “people who read what you’re reading also read this other thing — hey you might like it too!” (Maybe marketing will come up with something buzzier with “social” in the name somewhere.) As long as a site is sufficiently active, usage-based recommendations will tend to have a high perceived relevance, probably since they reflect real human preferences.

There is a huge literature in this field, as a search for “content recommendation” or “collaborative filtering” will quickly uncover. We won’t attempt to survey it here. In this article we examine a very simple implementation, look at some sample data, discuss a few interesting scaling problems it raises, and propose solutions to them.

Implementation Zero

The idea of user-based recommendations is to record an association between content items whenever a single user performs an action that indicates an interest in both items. Depending on the site, this might be based on purchase behavior, on search behavior, or on user ratings and/or comments. In our system — which is really primarily a content-delivery site, rather than a transactional site — we record an association whenever two content items are viewed by the same user in the same session.

We were already recording every request for content in a SQL table, in order to drive customer usage reports. This request table contains a unique session id (we used the JSESSIONID for this J2EE-based site), and a content id (uri). This is the essential information required to drive the recommendation feature. There are some implementation details here: do you record the usage directly from the application (if you do, beware of table locking!), or in a batch process by analyzing logs (in which case, make sure your URLs are neat and clean, or you are an expert with regular expressions), but those are a topic for another post!

We have the raw data we need, and we want to write a SQL query that pulls out recommendations from that. But there really is no efficient way to write that query with the data in its raw form. We had one truly horrible implementation coded up by a contractor using Hibernate, who came up with something like this (actually this is much neater and simpler than the Hibernate-generated query, but it exhibits the same scaling characteristics):

 select * from (
      select r2.uri, count(*) as freq from request r1, request r2
        r1.uri = ?uri? and
        r2.uri <> ?uri? and
      group by r2.uri
    ) order by freq desc
      limit 5

That should give us the result we want. It selects the documents most often viewed in the same session as the current document (the value of the ?uri? parameter). But it is horribly inefficient. The query time grows as the square of the number of document views. Over time, the number of document views may grow infinitely; recommendations should improve as we include more user data, so we want to include lots of data, but we would really prefer a query that doesn’t get slower as the quality of the recommendations increases!

Implementation One

As with so many database query performance problems, the solution here is to denormalize the data. If we precompute the association between each pair of documents, we can very quickly select the most related documents. We define the related_document table as follows:

create table related_document (
  u1 int,
  u2 int,
  freq int,
  fnorm int,
  index related_document_fnorm_idx (fnorm),
  index related_document_uri_idx (u1, u2),
  index related_document_uri2_idx (u2, u1)

u1 and u2 reference document uris; u1 is always less than u2, since we don’t want to store both sides of the symmetric relation. freq stores the number of times the two different uris, u1 and u2, are found to have been associated. fnorm is computed as:


Given data in this form, we can select related document recommendations with this simple query:

    select u1, u2 from related_document where u1=?uri? or u2=?uri?
     order by fnorm desc limit 5

Given the indexes defined above, the query will execute very quickly, perhaps in constant time (if the index is a hash index), or in log time (if it is some kind of tree-based index). We even have a covering index, so rows never need be retrieved from the table. Note that it is necessary to index both (u1,u2) and (u2,u1) since we only store each uri pair in a single order, but we we want to retrieve all pairs that include uri, so it will appear on both sides.

And in case it wasn’t clear why we compute the normalized score, fnorm; in short, it can be interpreted as a probability that a random person who viewed document u1 also viewed document u2 (in the same session). If we used the raw freq score to make recommendations, we would tend to over-recommend popular documents. Consider an extreme case: Let’s say there is some document that everyone views, for some reason. This document will have the highest freq score relation with every other document: it will always be the first document recommended on every page. Clearly it’s not interesting to recommend this document to people who will almost certainly have seen it already, anyway.

Meanwhile, problems persist in the space-time continuum

You may have noticed a problem with the approach we sketched out above. Remember that we said the original query was borked because it had an N^2 slowdown? Well all we’ve really done here is to take a time problem and turned it into a space problem. Because our related_document table stores a row for every (viewed) pair of documents, it can potentially store N^(N+1)/2 rows, where N is the number of documents. If our site has one million documents, this table could in theory have five hundred billion rows, which probably won’t work.

In reality, since we don’t store rows with zero counts, and it is likely that many (most) document relations will never occur, the problem isn’t quite that bad. In fact, since we processed the statistics in batches as they come in from the site, we have a rough sense of the growth rate of the pair statistics relative to the number of requests. What we observed is an approximately proportional (linear ~ 20x) growth, rather than growth as the square of the request count, which presumably reflects the strong relationships among the content requests (people interested in hermeneutics read hermeneutics articles, nanotechnology not so much). After having recorded 500K requests, we have 10M distinct request pairs which require about 1.2G of file space. This growth tracks the number of recorded requests, not documents, and the number of requests will grow without bound. We’d like a reasonable way to bound the growth. We don’t want our site to go down because our database server ran out of disk space!

The good news is that our space problem is more tractable than our time problem was. It turns out that the distribution of content requests on web sites tends to follow a power law (also called a Zipf distribution, Pareto Optimum, and other fancy names). Essentially the idea here is that not all documents are created equal; some are much more popular than others. Because of this, the “probability mass” in the related_document table tends to be distributed unevenly. The idea we pursued is to use this unevenness to guide us in pruning rows from the table to keep it at a manageable size while still keeping the most valuable statistics. Basically, we should be able to sort the related_document table by freq, or by fnorm, and truncate it after some fixed number of rows that we choose. What we’d like to know is: how many rows should we keep? And should we sort by freq, or by fnorm? To answer these questions, let’s look at some data we captured from actual usage.

First-order statistics

This site has around 10M unique content URIs, and the data samples shown here were drawn from just over 10M requests for content. There are 1.2M distinct URIs in the sample. The following graphs demonstrate the uneven distribution of requests.

Figure 1. Content request distribution


Shows the number of requests per uri, where the uris are sorted in reverse frequency order. The distribution approximates 1/N. The most and least frequent uris are not shown since otherwise at this scale all the points lie on the axes).

Figure 2. Content request cumulative distribution


Shows the same underlying data as in Figure 1, but the cumulative distribution is shown. For any given content item, the graph shows the number of requests for that item and more popular items.

The graphs only show a portion of a smaller sample, but they give the flavor of all of these sorts of distributions. Some quick calculations give a more precise characterization of the distribution of a sample of 1M requests. There were 340k distinct URIs. The most-requested item had 23000 requests. The cumulative distribution reaches 80% of all usage at rank=164000, so including half of the requested content. This is not an 80/20 distribution – it has a much fatter tail.

Second-order statistics

Our main interest here is to ensure that we provide quality recommendations to our users. To do this efficiently for a large number of documents, we need to be able to prune the document-pair statistics, which will tend to degrade the recommendations because it throws away useful information. These considerations lead us to seek a numerical measure of the quality of the recommendations that we can easily maximize. If we have that, we can know what is the least destructive pruning we can do while still keeping our data size below some reasonable bound. Basically we want to be able to sort the data by some metric, and trim off some part of the tail, without losing too much recommendation goodness.

We devised some metrics that are supposed to indicate the amount of useful information contained in various subsets of the data. These are all based on the distribution of uris over request-pairs, in a few different forms. One idea is to maximize the number of content items whose pages will have recommendations on them. Figure 3 below shows this. In particular, it accumulates the number of distinct content items included in the pairs, with the pairs sorted in decreasing frequency order. This metric shows the degree to which we willl have “covered all bases”: it’s ideal for demonstrating success to the OCD product manager. In the sample shown below, 82% of all content is covered by the top 10% of pairs.  You can see this by noting that the total number of pairs is about 55K, 10% of the total number of pairs (1.2M) is 120K, and eyeballing the graph we can see it reads about 45K uris covered by the top 120K pairs – about 80%.  The distribution is more heavily front-loaded than the first-order statistics, which we’d expect since it should operate something like sum(1/n^2) (the square of the first order graph).

Figure 3. Content distribution over pairs, ranked by frequency

graph rises steeply and then levels off, with some bumps, like a knee of a seated person with a huge lap

count(distinct uri) in { (u1,u2) where rank(freq(u1,u2)) <= n}
The graph shows the number of distinct uris covered in a sample of 500K requests including about 55K unique uris.

A more refined notion is to be concerned, not so much about how many content items are covered, but about how many page requests are covered. To do this, we measure probability mass rather than raw frequency. This has the effect of weighting more popular content more heavily, and makes a bit more sense if you think about user satisfaction: more requests for related content will be satisfied at some given percentage level using this metric than with the previous one. Figure 4 shows the distribution of the content probability mass over recommendation pairs. In this distribution, 79% of all requests are covered by the top 10% of pairs. The distribution is very similar to the previous one, indicating it’s probably not worth concerning ourselves with this fine distinction.

Figure 4. Content probability distribution over pairs, ranked by frequency

knee play

sum(freq(distinct uri)); uri in { (u1,u2) where rank(freq(u1,u2))
<= n }

The graph shows the number of requests
covered in a sample of 500K requests.

In the previous charts, we didn’t take into account how valuable the recommendations are. It’s one thing to pat ourselves on the back for recommending *something* on every page, or on as many requests as we can, but if these recommendations are of low quality, we can’t really claim to have satisfied anyone by putting them out there. As we discussed above, the normalized pair frequencies provide a measure of how related two content items are, independent of the popularity of those items, and is therefore a measure of the value of the recommendation. The next two diagrams show the same data as the previous two, with the pairs ranked by probabilty (fnorm) rather than raw frequency. Pruning according to this metric will allow us to retain the most accurate recommendations, as opposed to recommendations for the most popular content (which could provide an unrealistic positive feedback loop). I guess these are metrics for connaisseurs.

Ideally, we would like to preserve the “best” recommendations, and at the same time cover as many page requests as possible. The good news is that this distribution is not terribly different from the previous one: 80% of uris covered by the top 10% of pairs, ordered by probability. So we can choose to preserve “high-quality” recommendations without sacrificing coverage.

Figure 5. Content distribution over pairs, ranked by probability

a graph that looks like a knee, with some bumps

count(distinct uri) in { (u1,u2) where rank(fnorm(u1,u2)) <= n }

Again, we see the request distribution is very similar to the unique uri distribution; here we get 79% of requests covered by the top 10% of pairs.

Figure 6. Request distribution over pairs, ranked by probability

knee-like graph

SUM(freq(uri)); uri in { (u1,u2) where rank(fnorm(u1,u2)) <= n }

Managing Growth

A key measurement to make is how large the database will grow. This is to some extent a disk space consideration, but importantly a memory consideration. We want to be able to maintain all database indexes in memory for best performance, so it’s important to track the index sizes.

increase in database size as new requests are added, in 100K batches.  Red bars show deltas, blue line shows the total size

increase in database size as new requests are added, in 100K batches. Red bars show deltas, blue line shows the total size

We added data in increments of 100,000 requests and watched the size of the resulting related_document table. For the first several batches, the table grew by about 2M rows for each batch. But in later batches, starting at 500K total requests, the increases went down and the total began to level off somewhat. This probably reflects the fact that the number of pairs that actually occur is fairly sparse.  Still, we don’t see any real limit to the size yet, so truncating the data in a rational way remains necessary in order to manage growth, provide the best suggestion possible, and maintain a predictable resource profile.


The decision about how much data to keep ultimately depends on the resources available, and the relative importance of the recommendation feature. Ideally we would just keep all the data. But if resources are constrained, ranking request-pairs by their probability “value” and truncating the top portion of the distribution is a good approach to maintaining quality recommendations and good coverage. Also, ranking by frequency gives approximately similar results. Truncating the data at 10% will maintain about 80% coverage, in our sample. We gave some rough measurements about what sizes to expect based on different numbers of documents, and requests, but of course you should measure with your own data.

Tags: collaborative filtering, content recommendation, data, SQL, statistics,

2 Responses to “Recommending content from user data”

  1. Tom Beyer

    It’s really good – quite fascinating. It would be helpful to align the text and the graphs better.

    At one point you mention that the distribution isn’t a straight 80/20 rule – that you were seeing a fatter tail but in the end you seem to be truncating at 10% which gives about 80% coverage. Is there a reason for picking 10%? What does it take to get 85% or 90% coverage? Conversely what does truncating at 5% or 8% get you. Is 10% the point of maximum benefit with least expenditure? Should I be seeing this in the graphs you provided but I’m just not understanding them?

  2. Michael Sokolov

    Thanks for your feedback, Tom. I’m not sure what to do about the text/graph alignment – WordPress baffles me a bit, but I’ll see if I can improve that. I’ve edited the text to clarify the point about the distributions, I hope. The 10% point was an arbitrary point I chose since it gave about 80% coverage *of the pair statistics*. The distribution of pairs (second order stats) is something related to the square of the first order (document) statistics, so there are two different kinds of distributions I’m talking about.