Chapter 4. Common Reputation Models

Now we’re going to start putting our simple reputation building blocks from Chapter 3 to work. Let’s look at some actual reputation models to understand how the claims, inputs, and processes described in the last chapter can be combined to model a target entity’s reputation.

In this chapter, we name and describe a number of simple and broadly deployed reputation models, such as vote to promote, simple ratings, and points. You probably have some degree of familiarity with these patterns by simple virtue of being an active online participant. You see them all over the place; they’re the bread and butter of today’s social web. Later in this chapter, we show you how to combine these simple models and expand upon them to make real-world models.

Understanding how these simple models combine to form more complete ones will help you identify them when you see them in the wild. All of this will become important later in the book, as you start to design and architect your own tailored reputation models.

Simple Models

At their very simplest, some of the models we present next are really no more than fancified reputation primitives: counters, accumulators, and the like. Notice, however, that just because these models are simple doesn’t mean that they’re not useful. Variations on the favorites-and-flags, voting, ratings-and-reviews, and karma models are abundant on the Web, and the operators of many sites find that, at least in the beginning, these simple models suit their needs perfectly.

Favorites and Flags

The favorites-and-flags model excels at identifying outliers in a collection of entities. The outliers may be exceptional either for their perceived quality or for their lack of same. The general idea is this: give your community controls for identifying or calling attention to items of exceptional quality (or exceptionally low quality).

These controls may take the form of explicit votes for a reputable entity, or they may be more subtle implicit indicators of quality (such as the ability to bookmark content or send a link to it to a friend). A count of the number of times these controls are accessed forms the initial input into the system; the model uses that count to tabulate the entities’ reputations.

In its simplest form, a favorites-and-flags model can be implemented as a simple counter (Figure 4-1). When you start to combine them into more complex models, you’ll probably need the additional flexibility of a reversible counter.

Favorites, flags, or send-to-a-friend models can be built with a Simple Counter process—count ’em up and keep score.
Figure 4-1. Favorites, flags, or send-to-a-friend models can be built with a Simple Counter process—count ’em up and keep score.

The favorites-and-flags model has three variants: vote to promote, favorites, and report abuse.

Vote to promote

The vote-to-promote model, a variant of the favorites-and-flags model, has been popularized by crowd-sourced news sites such as Digg, Reddit, and Yahoo! Buzz. In a vote-to-promote system, a user promotes a particular content item in a community pool of submissions. This promotion takes the form of a vote for that item, and items with more votes rise in the rankings to be displayed with more prominence.

Vote to promote differs from this-or-that voting (see the section This-or-That Voting) primarily in the degree of boundedness around the user’s options. Vote to promote enacts an opinion on a reputable object within a large, potentially unbounded set (sites like StumbleUpon, for instance, have the entire Web as its candidate pool of potential objects).

Favorites

Counting the number of times that members of your community bookmark a content item can be a powerful method for tabulating content reputation. This method provides a primary value (see the sidebar Provide a Primary Value) to the user: bookmarking an item gives the user persistent access to it, and the ability to save, store, or retrieve it later. And, of course, it also provides a secondary value to the reputation system.

Report abuse

Unfortunately, there are many motivations in user-generated content applications for users to abuse the system. So it follows that reputation systems play a significant role in monitoring and flagging bad content. This is not that far removed from bookmarking the good stuff. The most basic type of reputation model for abuse moderation involves keeping track of the number of times the community has flagged something as abusive. Craigslist uses this mechanism, and sets a custom threshold for each item listed based on a per-user, per-category, and even per-city basis—though the value and the formulation are always kept secret from the users.

Typically, once a certain threshold is reached, either the application or human agents (staff) will act upon the content accordingly, or some piece of application logic will determine the proper automated outcome: remove the offending item, properly categorize it (for instance, add an adult content disclaimer to it), or add it to a prioritized queue for human agent intervention.

Tip

If your application is at a scale where automated responses to abuse reports are necessary, you’ll probably want to consider tracking reputations for abuse reporters themselves. See Who watches the watchers? for more.

This-or-That Voting

If you give your users options for expressing their opinion about something, you are giving them a vote. A very common use of the voting model (which we’ve referenced throughout this book) is to allow community members to vote on the usefulness, accuracy, or appeal of something.

To differentiate from more open-ended voting schemes like vote to promote, it may help to think of these types of actions as this-or-that voting: choosing from the most attractive option within a bounded set of possibilities (see Figure 4-2).

It’s often more convenient to store that reputation statement back as a part of the reputable entity that it applies to, making it easier, for example, to fetch and display a Was this review helpful? score (see Figure 2-7 in Chapter 2).

Those Helpful Review scores that you see are often nothing more than a Simple Ratio.
Figure 4-2. Those Helpful Review scores that you see are often nothing more than a Simple Ratio.

Ratings

When an application offers users the ability to express an explicit opinion about the quality of something, it typically employs a ratings model (Figure 4-3). There are a number of different scalar-value ratings: stars, bars, HotOrNot, or a 10-point scale. (We discuss how to choose from among the various types of ratings inputs in the section Determining Inputs.) In the ratings model, ratings are gathered from multiple individual users and rolled up as a community average score for that target.

Individual ratings contribute to a community average.
Figure 4-3. Individual ratings contribute to a community average.

Reviews

Some ratings are most effective when they travel together. More complex reputable entities frequently require more nuanced reputation models, and the ratings-and-review model, shown in Figure 4-4, allows users to express a variety of reactions to a target. Although each rated facet could be stored and evaluated as its own specific reputation, semantically that wouldn’t make much sense; it’s the review in its entirety that is the primary unit of interest.

In the reviews model, a user gives a target a series of ratings and provides one or more freeform text opinions. Each individual facet of a review feeds into a community average.

A full user review typically is made up of a number of ratings and some freeform text comments. Those ratings with a numerical value can, of course, contribute to aggregate community averages as well.
Figure 4-4. A full user review typically is made up of a number of ratings and some freeform text comments. Those ratings with a numerical value can, of course, contribute to aggregate community averages as well.

Points

For some applications, you may want a very specific and granular accounting of user activity on your site. The points model, shown in Figure 4-5, provides just such a capability. With points, your system counts up the hits, actions, and other activities that your users engage in and keeps a running sum of the awards.

As a user engages in various activities, they are recorded, weighted, and tallied.
Figure 4-5. As a user engages in various activities, they are recorded, weighted, and tallied.

This is a tricky model to get right. In particular, you face two dangers:

  • Tying inputs to point values almost forces a certain amount of transparency into your system. It is hard to reward activities with points without also communicating to your users what those relative point values are. (See Keep Your Barn Door Closed (but Expect Peeking).)

  • You risk unduly influencing certain behaviors over others: it’s almost certain that some minority of your users (or, in a success-disaster scenario, the majority of your users) will make points-based decisions about which actions they’ll take.

Caution

There are significant differences between points awarded for reputation purposes and monetary points that you may dole out to users as currency. The two are frequently confounded, but reputation points should not be spendable.

If your application’s users must actually surrender part of their own intrinsic value in order to obtain goods or services, you will be punishing your best users, and you’ll quickly lose track of people’s real relative worths. Your system won’t be able to tell the difference between truly valuable contributors and those who are just good hoarders and never spend the points allotted to them.

It would be far better to link the two systems but allow them to remain independent of each other: a currency system for your game or site should be orthogonal to your reputation system. Regardless of how much currency exchanges hands in your community, each user’s underlying intrinsic karma should be allowed to grow or decay uninhibited by the demands of commerce.

Karma

A karma model is reputation for users. In the section Solutions: Mixing Models to Make Systems, we explained that a karma model usually is used in support of other reputation models to track or create incentives for user behavior. All the complex examples later in this chapter (Combining the Simple Models) generate and/or use a karma model to help calculate a quality score for other purposes, such as search ranking, content highlighting, or selecting the most reputable provider.

There are two primitive forms of karma models: models that measure the amount of user participation and models that measure the quality of contributions. When these types of karma models are combined, we refer to the combined model as robust. Including both types of measures in the model gives the highest scores to the users who are both active and produce the best content.

Participation karma

Counting socially and/or commercially significant events by content creators is probably the most common type of participation karma model. This model is often implemented as a point system (see the earlier section Points), in which each action is worth a fixed number of points and the points accumulate. A participation karma model looks exactly like Figure 4-5, where the input event represents the number of points for the action and the source of the activity becomes the target of the karma.

There is also a negative participation karma model, which counts how many bad things a user does. Some people call this model strikes, after the three-strikes rule of American baseball. Again, the model is the same, except that the application interprets a high score inversely.

Quality karma

A quality-karma model, such as eBay’s seller feedback model (see eBay Seller Feedback Karma), deals solely with the quality of user contributions. In a quality-karma model, the number of contributions is meaningless unless it is accompanied by an indication of whether each contribution is good or bad for business. The best quality-karma scores are always calculated as a side effect of other users evaluating the contributions of the target.

On eBay, a successful auction bid is the subject of the evaluation, and the results roll up to the seller: if there is no transaction, there should be no evaluation. For a detailed discussion of this requirement, see Karma is complex, built of indirect inputs. Look ahead to Figure 4-6 for a diagram of a combined ratings-and-reviews and quality-karma model.

A robust-karma model might combine multiple other karma scores—measuring, perhaps, not just a user’s output (Participation) but his effectiveness (or Quality) as well.
Figure 4-6. A robust-karma model might combine multiple other karma scores—measuring, perhaps, not just a user’s output (Participation) but his effectiveness (or Quality) as well.

Robust karma

By itself, a participation-based karma score is inadequate to describe the value of a user’s contributions to the community, and we will caution time and again throughout the book that rewarding simple activity is an impoverished way to think about user karma. However, you probably don’t want a karma score based solely on quality of contributions, either. Under this circumstance, you may find your system rewarding cautious contributors, ones who, out of a desire to keep their quality-ratings high, only contribute to safe topics, or—once having attained a certain quality ranking—decide to stop contributing to protect that ranking.

What you really want to do is to combine quality-karma and participation-karma scores into one score—call it robust karma. The robust-karma score represents the overall value of a user’s contributions: the quality component ensures some thought and care in the preparation of contributions, and the participation side ensures that the contributor is very active, that she’s contributed recently, and (probably) that she’s surpassed some minimal thresholds for user participation—enough that you can reasonably separate the passionate, dedicated contributors from the fly-by post-then-flee crowd.

The weight you’ll give to each component depends on the application. Robust-karma scores often are not displayed to users, but may be used instead for internal ranking or flagging, or as factors influencing search ranking; see Keep Your Barn Door Closed (but Expect Peeking), later in this chapter, for common reasons for this secrecy. But even when karma scores are displayed, a robust-karma model has the advantage of encouraging users both to contribute the best stuff (as evaluated by their peers) and to do it often.

When negative factors are included in factoring robust-karma scores, it is particularly useful for customer care staff—both to highlight users who have become abusive or users whose contributions decrease the overall value of content on the site, and potentially to provide an increased level of service to proven-excellent users who become involved in a customer service procedure. A robust-karma model helps find the best of the best and the worst of the worst.

Combining the Simple Models

By themselves, the simple models described earlier are not enough to demonstrate a typical deployed large-scale reputation system in action. Just as the ratings-and-reviews model is a combination of the simpler atomic models that we described in Chapter 3, most reputation models combine multiple smaller, simpler models into one complex system.

Caution

We present these models for understanding, not for wholesale copying. If we impart one message in this book, we hope it is this: reputation is highly contextual, and what works well in one context will almost inevitably fail in many others. Copying any existing implementation of a model too closely may indeed lead you closer to the surface aspects of the application that you’re emulating. Unfortunately, it may also lead you away from your own specific business and community objectives.

Part III shows how to design a system specific to your own product and context. You’ll see better results for your application if you learn from models presented in this chapter, then set them aside.

User Reviews with Karma

Eventually, a site based on a simple reputation model, such as the ratings-and-reviews model, is bound to become more complex. Probably the most common reason for increasing complexity is the following progression. As an application becomes more successful, it becomes clear that some of the site’s users produce higher-quality reviews. These quality contributions begin to significantly increase the value of the site to end users and to the site operator’s bottom line. As a result, the site operator looks for ways to recognize these contributors, increase the search ranking value of their reviews, and generally provide incentives for this value-generating behavior. Adding a karma reputation model to the system is a common approach to reaching those goals.

The simplest way to introduce a quality-karma score to a simple ratings-and-reviews reputation system is to introduce a Was this helpful? feedback mechanism that visiting readers may use to evaluate each review.

image with no caption

The example in Figure 4-7 is a hypothetical product reputation model, and the reviews focus on 5-star ratings in the categories overall, service, and price. These specifics are for illustration only and are not critical to the design. This model could just as well be used with thumb ratings and any arbitrary categories, such as sound quality or texture.

In this two-tiered system, users write reviews and other users review those reviews. The outcome is a lot of useful reputation information about the entity in question (here, Dessert Hut) and all the people who review it.
Figure 4-7. In this two-tiered system, users write reviews and other users review those reviews. The outcome is a lot of useful reputation information about the entity in question (here, Dessert Hut) and all the people who review it.

The combined ratings-and-reviews with karma model has one compound input: the review and the was-this-helpful vote. From these inputs, the community rating averages, the WasThisHelpful ratio, and the reviewer quality-karma rating are generated on the fly. Pay careful attention to the sources and targets of the inputs of this model; they are not the same users, nor are their ratings targeted at the same entities.

The model can be described as follows:

  1. The review is a compound reputation statement of claims related by a single source user (the reviewer) about a particular target, such as a business or a product:

    • Each review contains a text-only comment that typically is of limited length and that often must pass simple quality tests, such as minimum size and spell checking, before the application will accept it.

    • The user must provide an overall rating of the target; in this example, in the form of a 5-star rating, although it could be in any scale appropriate to the application.

    • Users who wish to provide additional detail about the target can contribute optional service and/or price scores. A reputation system designer might encourage users to contribute optional scores by increasing their reviewer quality karma if they do so. (This option is not shown in the diagram.)

    • The last claim included in the compound review reputation statement is the WasThisHelpful ratio, which is initialized to 0 out of 0 and is never actually modified by the reviewer but derived from the was-this-helpful votes of readers.

  2. The was-this-helpful vote is not entered by the reviewer but by a user (the reader) who encounters the review later. Readers typically evaluate a review itself by clicking one of two icons, thumb-up (Yes) or thumb-down (No), in response to the prompt “Did you find this review helpful?”.

This model has only three processes or outputs and is pretty straightforward. Note, however, the split shown for the was-this-helpful vote, where the message is duplicated and sent both to the Was This Helpful? process and the process that calculates reviewer quality karma. The more complex the reputation model, the more common this kind of split becomes.

Besides indicating that the same input is used in multiple places, a split also offers the opportunity to do parallel and/or distributed processing—the two duplicate messages take separate paths and need not finish at the same time or at all.

  1. The Community Overall Averages process calculates the average of all the component ratings in the reviews. The overall, service, and price claims are averaged. Since some of these inputs are optional, keep in mind that each claim type may have a different total count of submitted claim values.

    Because users may need to revise their ratings and the site operator may wish to cancel the effects of ratings by spammers and other abusive behavior, the effects of each review are reversible. This is a simple reversible average process, so it’s a good idea to consider the effects of bias and liquidity when calculating and displaying these averages (see the section Practitioner’s Tips: Reputation Is Tricky).

  2. The Was This Helpful? process is a reversible ratio, keeping track of the total (T) number of votes and the count of positive (P) votes. It stores the output claim in the target review as the HelpfulScore ratio claim with the value P out of T.

    Policies differ for cases when a reviewer is allowed to make significant changes to a review (for example, changing a formerly glowing comment into a terse This sucks now!). Many site operators simply revert all the was-this-helpful votes and reset the ratio. Even if your model doesn’t permit edits to a review, for abuse mitigation purposes, this process still needs to be reversible.

  3. After a simple point accumulation model, our reviewer quality User Karma process implements probably the simplest karma model possible: track the ratio of total was-this-helpful votes for all the reviews that a user has written to the total number of votes received. We’ve labeled this a custom ratio because we assume that the application will be programmed to include certain features in the calculation such as requiring a minimum number of votes before considering any display of karma to a user. Likewise, it is typical to create a nonlinear scale when grouping users into karma display formats, such as badges like top 100 reviewer. See the next section and Chapter 7 for more on display patterns for karma.

    Karma models, especially public karma models, are subject to massive abuse by users interested in personal status or commercial gain. For that reason, this process must be reversible.

Now that we have a community-generated quality-karma claim for each user (at least those who have written a review noteworthy enough to invite helpful votes), you may notice that this model doesn’t use that score as an input or weight in calculating other scores. This configuration is a reminder that reputation models all exist within an application context, and therefore the most appropriate use for this score will be determined by your application’s needs.

Perhaps you will keep the quality-karma score as a corporate (internal) reputation, helping to determine which users should get escalating customer support. Perhaps the score will be public, displayed next to every one of a user’s reviews as a status symbol for all to see. It might even be personal, shared only with each reviewer, so that reviewers can see what the overall community thinks of their contributions. Each of these choices has different ramifications, which we discuss in Chapter 6 in detail.

eBay Seller Feedback Karma

eBay contains the Internet’s most well-known and studied user reputation or karma system: seller feedback. Its reputation model, like most others that are several years old, is complex and continuously adapting to new business goals, changing regulations, improved understanding of customer needs, and the never-ending need to combat reputation manipulation through abuse. See Appendix B for a brief survey of relevant research papers about this system and Chapter 9 for further discussion of the continuous evolution of reputation systems in general.

Rather than detail the entire feedback karma model here, we focus on claims that are from the buyer and about the seller. An important note about eBay feedback is that buyer claims exist in a specific context: a market transaction, which is a successful bid at auction for an item listed by a seller. This specificity leads to a generally higher-quality karma score for sellers than they would get if anyone could just walk up and rate a seller without even demonstrating that they’d ever done business with them; see Implicit: Walk the Walk.

Tip

The reputation model in Figure 4-8 was derived from the following eBay pages: http://pages.ebay.com/help/feedback/scores-reputation.html and http://pages.ebay.com/sellerinformation/PowerSeller/requirements.html, both current as of March 2010.

We have simplified the model for illustration, specifically by omitting the processing for the requirement that only buyer feedback and detailed seller ratings (DSRs) provided over the previous 12 months are considered when calculating the positive feedback ratio, DSR community averages, and—by extension—power seller status. Also, eBay reports user feedback counters for the last month and quarter, which we are omitting here for the sake of clarity. Abuse mitigation features, which are not publicly available, are also excluded.

This simplified diagram shows how buyers influence a seller’s karma scores on eBay. Though the specifics are unique to eBay, the pattern is common to many karma systems.
Figure 4-8. This simplified diagram shows how buyers influence a seller’s karma scores on eBay. Though the specifics are unique to eBay, the pattern is common to many karma systems.

Figure 4-8 illustrates the seller feedback karma reputation model, which is made of typical model components: two compound buyer input claims—seller feedback and detailed seller ratings—and several roll-ups of the seller’s karma, including community feedback ratings (a counter), feedback level (a named level), positive feedback percentage (a ratio), and the power seller rating (a label).

The context for the buyer’s claims is a transaction identifier—the buyer may not leave any feedback before successfully placing a winning bid on an item listed by the seller in the auction market. Presumably, the feedback primarily describes the quality and delivery of the goods purchased. A buyer may provide two different sets of complex claims, and the limits on each vary:

  1. Typically, when a buyer wins an auction, the delivery phase of the transaction starts and the seller is motivated to deliver the goods of the quality advertised in a timely manner. After either a timer expires or the goods have been delivered, the buyer is encouraged to leave feedback on the seller, a compound claim in the form of a three-level rating—positive, neutral, or negative—and a short text-only comment about the seller and/or transaction. The ratings make up the main component of seller feedback karma.

  2. Once each week in which a buyer completes a transaction with a seller, the buyer may leave detailed seller ratings, a compound claim of four separate 5-star ratings in these categories: item as described, communications, shipping time, and shipping and handling charges. The only use of these ratings, other than aggregation for community averages, is to qualify the seller as a power seller.

eBay displays an extensive set of karma scores for sellers: the amount of time the seller has been a member of eBay, color-coded stars, percentages that indicate positive feedback, more than a dozen statistics that track past transactions, and lists of testimonial comments from past buyers or sellers. This is just a partial list of the seller reputations that eBay puts on display.

The full list of displayed reputations almost serves as a menu of reputation types present in the model. Every process box represents a claim displayed as a public reputation to everyone, so to provide a complete picture of eBay seller reputation, we simply detail each output claim separately.

  1. The Feedback Score counts every positive rating given by a buyer as part of seller feedback, a compound claim associated with a single transaction. This number is cumulative for the lifetime of the account, and it generally loses its value over time; buyers tend to notice it only if it has a low value.

    It is fairly common for a buyer to change this score, within some time limitations, so this effect must be reversible. Sellers spend a lot of time and effort working to change negative and neutral ratings to positive ratings to gain or to avoid losing a Power Seller Rating.

    When this score changes, it is used to calculate the feedback level.

  2. The Feedback Level process generates a graphical representation (in colored stars) of the feedback score. This is usually a simple data transformation and normalization process; here we’ve represented it as a mapping table, illustrating only a small subset of the mappings.

Caution

This visual system of stars on eBay relies, in part, on the assumption that users will know that a red shooting star is a better rating than a purple star. But we have our doubts about the utility of this representation for buyers. Iconic scores such as these often mean more to their owners, and they might represent only a slight incentive for increasing activity in an environment in which each successful interaction equals cash in your pocket.

  1. The Community Feedback Ratings process generates a compound claim containing the historical counts for each of the three possible seller feedback ratings—positive, neutral, and negative—over the last 12 months, so that the totals can be presented in a table showing the results for the last month, 6 months, and year. Older ratings are decayed continuously, though eBay does not disclose how often this data is updated if new ratings don’t arrive. One possibility would be to update the data whenever the seller posts a new item for sale.

    The positive and negative ratings are used to calculate the positive feedback percentage.

  2. The Positive Feedback Percentage process divides the positive feedback ratings by the sum of the positive and negative feedback ratings over the last 12 months. Note that the neutral ratings are not included in the calculation.

    Tip

    This is a recent change reflecting eBay’s confidence in the success of updates deployed in the summer of 2008 to prevent bad sellers from using retaliatory ratings against buyers who are unhappy with a transaction (known as tit-for-tat negatives). Initially this calculation included neutral ratings because eBay feared that negative feedback would be transformed into neutral ratings. It was not.

    This score is an input into the power seller rating, which is a highly coveted rating to achieve. This means that each and every individual positive and negative rating given on eBay is a critical one—it can mean the difference for a seller between acquiring the coveted power seller status or not.

  3. The Detailed Seller Ratings (DSR) Community Averages are simple reversible averages for each of the four ratings categories: item as described, communications, shipping time, and shipping and handling charges. There is a limit on how often a buyer may contribute DSRs.

    eBay only recently added these categories as a new reputation model because including them as factors in the overall seller feedback ratings diluted the overall quality of seller and buyer feedback. Sellers could end up in disproportionate trouble just because of a bad shipping company or a delivery that took a long time to reach a remote location. Likewise, buyers were bidding low prices only to end up feeling gouged by shipping and handling charges.

    Fine-grained feedback allows one-off small problems to be averaged out across the DSR community averages instead of being translated into red-star negative scores that poison overall trust. Fine-grained feedback for sellers is also actionable by them and motivates them to improve, since these DSR scores make up half of the power seller rating.

  4. The Power Seller Rating, appearing next to the seller’s ID, is a prestigious label that signals the highest level of trust. It includes several factors external to this model, but two critical components are the positive feedback percentage, which must be at least 98%, and the DSR community averages, which each must be at least 4.5 stars (around 90% positive). Interestingly, the DSR scores are more flexible than the feedback average, which tilts the rating toward overall evaluation of the transaction rather than the related details.

Though the context for the buyer’s claims is a single transaction or history of transactions, the context for the aggregate reputations that are generated is trust in the eBay marketplace itself. If the buyers can’t trust the sellers to deliver against their promises, eBay cannot do business. When considering the roll-ups, we transform the single-transaction claims into trust in the seller, and—by extension—that same trust rolls up into eBay. This chain of trust is so integral and critical to eBay’s continued success that it must continuously update the marketplace’s interface and reputation systems.

Flickr Interestingness Scores for Content Quality

The popular online photo service Flickr uses reputation to qualify new user submissions and track user behavior that violates Flickr’s terms of service. Most notably, Flickr uses a completely custom reputation model called interestingness for identifying the highest-quality photographs submitted from the millions uploaded every week. Flickr uses that reputation score to rank photos by user and, in searches, by tag.

Interestingness is also the key to Flickr’s Explore page, which displays a daily calendar of the photos with the highest interestingness ratings, and users may use a graphical calendar to look back at the worthy photographs from any previous day. It’s like a daily leaderboard for newly uploaded content.

Tip

The version of Flickr interestingness that we are presenting here is an abstraction based on several different pieces of evidence: the U.S. patent application (Number 2006/0242139 A1) filed by Flickr, comments that Flickr staff have made on their own message boards, observations by power users in the community, and our own experience in building such reputation systems.

We offer two pieces of advice for anyone building similar systems: there is no substitute for gathering historical data when you are deciding how to clip and weight your calculations, and—even if you get your initial settings correct—you will need to adjust them over time to adapt to the use patterns that will emerge as the direct result of implementing reputation. (See the section Emergent effects and emergent defects.)

Interestingness ratings are used in several places on the Flickr site, but most noticeably on the Explore page, a daily calendar of photos selected using this content reputation model.
Figure 4-9. Interestingness ratings are used in several places on the Flickr site, but most noticeably on the Explore page, a daily calendar of photos selected using this content reputation model.

Figure 4-9 has two primary outputs: photo interestingness and interesting photographer Karma, and everything else feeds into those two key claims.

Of special note in this model is the existence of a karma loop (represented in the figure by a dashed-pipe). A user’s reputation score influences how much weight her opinion carries when evaluating others’ work (commenting on it, favoriting it, or adding to groups): photographers with higher interestingness karma on Flickr have a greater voice in determining what constitutes interesting on the site.

Each day, Flickr generates and stores a list of the top 500 most interesting photos for the Explore page. It also updates the current interestingness score of each and every photo each time one of the input events occurs. Here, we illustrate a real-time model for that update, though it isn’t at all clear that Flickr actually does these calculations in real time, and there are several good reasons to consider delaying that action. See Keep Your Barn Door Closed (but Expect Peeking).

Since there are four main paths through the model, we’ve grouped all the inputs by the kind of reputation feedback they represent: viewer activities, tagging, flagging, and republishing. Each path provides a different kind of input into the final reputations.

  1. Viewer activities represent the actions that a viewing user performs on a photo. Each action is considered a significant endorsement of the photo’s content because any action requires special effort by the user. We have assumed that all actions carry equal weight, but that is not a requirement of the model:

    • A viewer can attach a note to the photo by adding a rectangle over a region of the photo and typing a short note.

    • When a viewer comments on a photo, that comment is displayed for all other viewers to see. The first comment is usually the most important, because it encourages other viewers to join the conversation. We don’t know whether Flickr weighs the first comment more heavily than subsequent ones. (Though that is certainly common practice in some reputation models.)

    • By clicking the Add to Favorites icon, a viewer not only endorses a photo but shares that endorsement—the photo now appears in the viewer’s profile, on her My Favorites page.

    • If a viewer downloads the photo (depending on a photo’s privacy settings, image downloads are available in various sizes), that is also counted as a viewer activity. (Again, we don’t know for sure, but it would be smart on Flickr’s part to count multiple repeat downloads as only one action, lest they risk creating a back door to attention-gaming shenanigans.)

    • Finally, the viewer can click Send to Friend, creating an email with a link to the photo. If the viewer addresses the message to multiple users or even a list, this action could be considered republishing. However, applications generally can’t distinguish a list address from an individual person’s address, so for reputation purposes, we assume that the addressee is always an individual.

  2. Tagging is the action of adding short text strings describing the photo for categorization. Flickr tags are similar pregenerated categories, but they exist in a folksonomy: whatever tags users apply to a photo, that’s what the photo is about. Common tags include 2009, me, Randy, Bryce, Fluffy, and cameraphone, along with the expected descriptive categories of wedding, dog, tree, landscape, purple, tall, and irony—which sometimes means made of iron!

    Tagging gets special treatment in a reputation model because users must apply extra effort to tag an object, and determining whether one tag is more likely to be accurate than another requires complicated computation. Likewise, certain tags, though popular, should not be considered for reputation purposes at all. Tags have their own quantitative contribution to interestingness, but they also are considered viewer activities, so the input is split into both paths.

  3. Sadly, many popular photographs turn out to be pornographic or in violation of Flickr’s terms of service.

    Tip

    On many sites—if left untended—porn tends to quickly generate a high-quality reputation score. Remember, quality as we’re discussing it is, to some degree, a measure of attention. Nothing garners attention like appealing to prurient interests.

    The smart reputation designer can, in fact, leverage this unfortunate truth. Build a corporate-user porn probability reputation into your system—one that identifies content with a high (or too-high) velocity of attention and puts it in a prioritized queue for human agents to review.

    Flagging is the process by which users mark content as inappropriate for the service. This is a negative reputation vote: by tagging a photo as abusive, the user is saying this doesn’t belong here. This strong action should decrease the interestingness score fast—faster, in fact, than the other inputs can raise it.

  4. Republishing actions represent a user’s decision to increase the audience for a photo by either adding it to a Flickr group or embedding it in a web page. Users can accomplish either by using the blog publishing tools in Flickr’s interface or by copying and pasting an HTML snippet that the application provides. Flickr’s patent doesn’t specifically say that these two actions are treated similarly, but it seems reasonable to do so.

Generally, four things determine a Flickr photo’s interestingness (represented by the four parallel paths in Figure 4-9): the viewer activity score, which represents the effect of viewers taking a specific action on a photo; tag relatedness, which represents a tag’s similarity to others associated with other tagged photos; the negative feedback adjustment, which reflects reasons to downgrade or disqualify the tag; and group weighting, which has an early positive effect on reputation with the first few events.

  1. The events coming into the Karma Weighting process are assumed to have a normalized value of 0.5, because the process is likely to increase it. The process reads the interesting-photographer karma of the user taking the action (not the person who owns the photo) and increases the viewer activity value by some weighting amount before passing it onto the next process. As a simple example, we’ll suggest that the increase in value will be a maximum of 0.25—with no effect for a viewer with no karma and 0.25 for a hypothetical awesome user whose every photo is beloved by one and all. The resulting score will be in the range 0.5 to 0.75. We assume that this interim value is not stored in a reputation statement for performance reasons.

  2. Next, the Relationship Weighting process takes the input score (in the range of 0.5 to 0.75) and determines the relationship strength of the viewer to the photographer. The patent indicates that a stronger relationship should grant a higher weight to any viewer activity. Again, for our simple example, we’ll add up to 0.25 for a mutual first-degree relationship between the users. Lower values can be added for one-way (follower) relationships or even relationships as members of the same Flickr groups. The result is now in the range of 0.5 to 1.0 and is ready to be added into the historical contributions for this photo.

  3. The Viewer Activity Score is a simple accumulator and custom denormalizer that sums up all the normalized event scores that have been weighted. In our example, they arrive in the range of 0.5 to 1.0. It seems likely that this score is the primary basis for interestingness. The patent indicates that each sum is marked with a timestamp to track changes in viewer activity score over time.

    The sum is then denormalized against the available range, from 0.5 to the maximum known viewer activity score, to produce an output from 0.0 to 1.0, which represents the normalized accumulated score stored in the reputation system so that it can be used to recalculate photo interestingness as needed.

  4. Unlike most of the reputation messages we’ve considered so far, the incoming message to the tagging process path does not include any numeric value at all; it contains only the text tag that the viewer is adding to the photo. The tag is first subjected to the Tag Blacklist process, a simple evaluator that checks the tag against a list of forbidden words. If the flow is terminated for this event, there is no contribution to photo interestingness for this tag.

    Tip

    Separately, it seems likely that Flickr would want a tag on the list of forbidden words to have a negative, penalizing effect on the karma score for the person who added it.

    Otherwise, the tag is considered worthy of further reputation consideration and is sent on to the Tag Relatedness process. Only if the tag was on the list of forbidden words is it likely that any record of this process would be saved for future reference.

  5. The nonblacklisted tag then undergoes the Tag Relatedness process, which is a custom computation of reputation based on cluster analysis described in the patent in this way (from Flickr’s U.S. Patent Application No. 2006/0242139 A1):

    [0032] As part of the relatedness computation, the statistics engine may employ a statistical clustering analysis known in the art to determine the statistical proximity between metadata (e.g., tags), and to group the metadata and associated media objects according to corresponding cluster. For example, out of 10,000 images tagged with the word Vancouver, one statistical cluster within a threshold proximity level may include images also tagged with Canada and British Columbia. Another statistical cluster within the threshold proximity may instead be tagged with Washington and space needle along with Vancouver. Clustering analysis allows the statistics engine to associate Vancouver with both the Vancouver-Canada cluster and the Vancouver-Washington cluster. The media server may provide for display to the user the two sets of related tags to indicate they belong to different clusters corresponding to different subject matter areas, for example.

    This is a good example of a black-box process that may be calculated outside of the formal reputation system. Such processes are often housed on optimized machines or run continuously on data samples in order to give best-effort results in real time.

    For our model, we assume that the output will be a normalized score from 0.0 (no confidence) to 1.0 (high confidence) representing how likely the tag is related to the content. The simple average of all the scores for the tags on this photo is stored in the reputation system so that it can be used to recalculate photo interestingness as needed.

  6. The Negative Feedback path determines the effects of flagging a photo as abusive content. Flickr documentation is nearly nonexistent on this topic (for good reason; see Keep Your Barn Door Closed (but Expect Peeking)), but it seems reasonable to assume that even a small number of negative feedback events should be enough to nullify most, if not all, of a photo’s interestingness score.

    For illustration, let’s say that it would take only five abuse reports to do the most damage possible to a photo’s reputation. Using this math, each abuse report event would be worth 0.2. Negative feedback can be thought of as a Reversible Accumulator with a maximum value of 1.0.

Tip

This model doesn’t account for abuse by users ganging up on a photo and flagging it as abusive when it is not. (See Who watches the watchers?). That is a different reputation model, which we illustrate in detail in Chapter 10.

  1. The last component of the process is the republishing path. When a photo gets even more exposure by being shared on channels such as blogs and Flickr groups, then Flickr assigns some additional reputation value to it, shown here as the Group Weighting process.

    Flickr official forum posts indicate that for the first five or so actions, this value quickly increases to its maximum value—1.0 in our system. After that, it stabilizes, so this process is also a simple accumulator, adding 0.2 for every event and capping at 1.0.

  2. All of the inputs to Photo Interestingness, a simple mixer, are normalized scores from 0.0 to 1.0 and represent either positive (viewer activity score, tag relatedness, group weighting) or negative (negative feedback) effects on the claim.

    The exact formulation for this calculation is not detailed in any documentation, nor is it clear that anyone who doesn’t work for Flickr understands all its subtleties. But…for illustration purposes, we propose this drastically simplified formulation: photo interestingness is made up of 20% each of group weighting and tag relatedness plus 60% viewer activity score minus negative feedback.

    A common early modification to a formulation like this is to increase the positive percentages enough so that no minor component is required for a high score. For example, you could increase the 60% viewer activity score to 80% and then cap the result at 1.0 before applying any negative effects.

    A copy of this claim value is stored in the same high-performance database as the rest of the search-related metadata for the target photo.

  3. The Interesting Photographer Karma score is recalculated each time the interestingness reputation of one of the photos changes. This liquidity compensated average is sufficient when using this karma to evaluate other user’s photos.

The Flickr model is undoubtedly complex and has spurred a lot of discussion and mythology in the photographer community on Flickr.

It’s important to reinforce the point that all of this computational work is in support of three very exact contexts: interestingness works specifically to influence photos’ search rank on the site, their display order on user profiles, and ultimately whether or not they’re featured on the site-wide Explore page. It’s the third context, Explore, that introduces one more important reputation mechanic: randomization.

Each day’s photo interestingness calculations produce a ranked list of photos. If the content of the Explore page were 100% determined by those calculations, it could get boring. First-mover effects can predict that you would probably always see the same photos by the same photographers at the top of the list (see the section First-mover effects). Flickr lessens this effect by including a random factor in the selection of the photos.

Each day, the top 500 photos appear in randomized order. In theory, the photo with the 500th-ranked photo interestingness score could be displayed first and the one with the highest photo interestingness score could be displayed last. The next day, if they’re still on the top-500 list, they could both appear somewhere in the middle.

This system has two wonderful effects:

  • A more diverse set of high-quality photos and photographers gets featured, encouraging more participation by the users producing the best content.

  • It mitigates abuse, because the photo interestingness score is not displayed and the randomness of the display prevents it from being deduced. Randomness makes it nearly impossible to reverse-engineer the specifics of the reputation model—there is simply too much noise in the system to be certain of the effects of smaller contributions to the score.

What’s truly wonderful is that this randomness doesn’t harm Explore’s efficacy in the least; given the scale and activity of the Flickr community, each and every day there are more than enough high-quality photos to fill a 500-photo list. Jumbling up the order for display doesn’t detract from the experience of browsing them by one whit.

When and Why Simple Models Fail

As a business owner on today’s Web, probably the greatest thing about social media is that the users themselves create the media from which you, the site operator, capture value. This means, however, that the quality of your site is directly related to the quality of the content created by your users.

This can present problems. Sure, the content is cheap, but you usually get what you pay for, and you will probably need to pay more to improve the quality. Additionally, some users have a different set of motivations than you might prefer.

We offer design advice to mitigate potential problems with social collaboration and suggestions for specific nontechnical solutions.

Party Crashers

As illustrated in the real-life models earlier, reputation can be a successful motivation for users to contribute large volumes of content and/or high-quality content to your application. At the very least, reputation can provide critical money-saving value to your customer care department by allowing users to prioritize the bad content for attention and likewise flag power users and content to be featured.

But mechanical reputation systems, of necessity, are always subject to unwanted or unanticipated manipulation; they are only algorithms, after all. They cannot account for the many, sometimes conflicting, motivations for users’ behavior on a site. One of the strongest motivations of users who invade reputation systems is commercial. Spam invaded email. Marketing firms invade movie review and social media sites. And drop-shippers are omnipresent on eBay.

eBay drop-shippers put the middleman back into the online market; they are people who resell items that they don’t even own. It works roughly like this:

  1. A seller develops a good reputation, gaining a seller feedback karma of at least 25 for selling items that she personally owns.

  2. The seller buys some drop-shipping software, which helps locate items for sale on eBay and elsewhere cheaply, or joins an online drop-shipping service that has the software and presents the items in a web interface.

  3. The seller finds cheap items to sell and lists them on eBay for a higher price than they’re available from the drop-shipper but lower than other eBay sellers are selling them for. The seller includes an average or above-average shipping and handling charge.

  4. The seller sells an item to a buyer, receives payment, and sends an order for the item, along with a drop-shipping payment, to the drop-shipper, who then delivers the item to the buyer.

This model of doing business was not anticipated by the eBay seller feedback karma model, which only includes buyers and sellers as reputation entities. Drop-shippers are a third party in what was assumed to be a two-party transaction, and they cause the reputation model to break in various ways:

  • The drop-shippers sometimes fail to deliver the goods as promised to the buyer. The buyer then gets mad and leaves negative feedback: the dreaded red star. That would be fine, but it is the seller—who never saw or handled the good—that receives the mark of shame, not the actual shipping party.

  • This arrangement is a big problem for the seller, who cannot afford the negative feedback if she plans to continue selling on eBay.

  • The typical options for rectifying a bungled transaction won’t work in a drop-shipper transaction: it is useless for the buyer to return the defective goods to the seller. (They never originated from the seller anyway.) Trying to unwind the shipment (the buyer returns the item to the seller; the seller returns it to the drop-shipper, if that is even possible; the drop-shipper buys or waits for a replacement item and finally ships it) would take too long for the buyer, who expects immediate recompense.

In effect, the seller can’t make the order right with the customer without refunding the purchase price in a timely manner. This puts them out-of-pocket for the price of the goods along with the hassle of trying to recover the money from the drop-shipper.

But a simple refund alone sometimes isn’t enough for the buyer! Depending on the amount of perceived hassle and effort this transaction has cost the buyer, he is still likely to rate the transaction negatively overall. (And rightfully so. Once it’s become evident that a seller is working through a drop-shipper, many of their excuses and delays start to ring very hollow.) So a seller may have, at this point, outlayed a lot of her own time and money to rectify a bad transaction only to still suffer the penalties of a red star.

What option does the seller have left to maintain her positive reputation? You guessed it—a payoff. Not only will a concerned seller eat the price of the goods—and any shipping involved—but she will also pay an additional cash bounty (typically up to $20.00) to get buyers to flip a red star to green.

What is the cost of clearing negative feedback on drop-shipped goods? The cost of the item + $20.00 + lost time negotiating with the buyer. That’s the cost that reputation imposes on drop-shipping on eBay.

The lesson here is that a reputation model will be reinterpreted by users as they find new ways to use your site. Site operators need to keep a wary eye on the specific behavior patterns they see emerging and adapt accordingly. Chapter 9 provides more detail and specific recommendations for prospective reputation modelers.

Keep Your Barn Door Closed (but Expect Peeking)

You will—at some point—be faced with a decision about how open (or not) to be about the details of your reputation system. Exactly how much of your model’s inner workings should you reveal to the community? Users inevitably will want to know:

This decision is not at all trivial: if you err on the side of extreme secrecy, you risk damaging your community’s trust in the system that you’ve provided. Your users may come to question its fairness or—if the inner workings remain too opaque—they may flat-out doubt the system’s accuracy.

Most reputation-intensive sites today attempt at least to alleviate some of the community’s curiosity about how content reputations and user reputations are earned. It’s not like you can keep your system a complete secret.

Equally bad, however, is divulging too much detail about your reputation system to the community. And more site designers probably make this mistake, especially in the early stages of deploying the system and growing the community. As an example, consider the highly specific breakdown of actions on the Yahoo! Answers site, and the points rewarded for each (see Figure 4-10).

How to succeed at Yahoo! Answers? The site courteously provides you with a scorecard.
Figure 4-10. How to succeed at Yahoo! Answers? The site courteously provides you with a scorecard.

Why might this breakdown be a mistake? For a number of reasons. Assigning overt point values to specific actions goes beyond enhancing the user experience and starts to directly influence it. Arguably, it may tip right over into the realm of dictating user behavior, which generally is frowned upon.

A detailed breakdown also arms the malcontents in your community with exactly the information they need to deconstruct your model. And they won’t even need to guess at things like relative weightings of inputs into the system; the relative value of different inputs is right there on the site, writ large. Try, instead, to use language that is clear and truthful without necessarily being comprehensive and exhaustively complete, like this example from the Yahoo! UK Message Boards:

The exact formula that determines medal-achievement will not be made public (and is subject to change) but, in general, it may be influenced by the following factors: community response to your messages (how highly others rate your messages); the amount of (quality) contributions that you make to the boards; and how often and accurately you rate others’ messages.

Staying vague does not mean, of course, that some in your community won’t continue to wonder, speculate, and talk among themselves about the specifics of your reputation system. Algorithm gossip has become something of a minor sport on collaborative sites like Digg and YouTube.

For some participants, guessing at the workings of reputations like highest rated or most popular is probably just that—an entertaining game and nothing more. Others, however, see only the benefit of any insight they might be able to gain into the system’s inner workings: greater visibility for themselves and their content, more influence within the community, and the greater currency that follows both. (See Egocentric incentives.)

The following are some helpful strategies for masking the inner workings of your reputation models and algorithms.

Decay and delay

Time is on your side. Or it can be, in one of a couple of ways. First, consider the use of time-based decay in your models: recent actions count for more than actions in the distant past, and the effects of older actions decay (lessen) over time. Incorporating time-based delays has several benefits:

  • Reputation leaders can’t rest on their laurels. When reputations decay, they have to be earned back continually. This requirement encourages your community to stay active and engage with your site frequently.

  • Decay is an effective counter to the stagnation that naturally results from network effects (see First-mover effects). Older, more established participants will not tend to linger at the top of rankings quite as much.

  • Those who do probe the system to gain an unfair advantage will not reap long-term benefits from doing so unless they continue to do it within the constraints imposed by the decay. (Coincidentally, this profile of behavior makes it easier to spot—and correct for—users who are gaming the system.)

It’s also beneficial to delay the results of newly triggered inputs. If a reasonable window of time exists between the triggering of an input (marking a photo as a favorite, for instance) and the resulting effect on that object’s reputation (moving the photo higher in a visible ranking), it confounds a gaming user’s ability to do easy what-if comparisons (particularly if the period of delay is itself unpredictable).

When the reputation effects of various actions are instantaneous, you’ve given the gamers of your system a powerful analytic tool for reverse-engineering your models.

Provide a moving target

We’ve already cautioned that it’s important to keep your system flexible (see Plan for Change). That’s not just good advice from a technical standpoint, but from a social and strategic one as well. Put simply: leave yourself enough wiggle room to adjust the impact of different inputs in the system (add new inputs, change their relative weightings, or eliminate ones that were previously considered). That flexibility gives you an effective tool for confounding gaming of the system. If you suspect that a particular input is being exploited, you at least have the option of tweaking the model to compensate for the abuse. You will also want the flexibility of introducing new types of reputations to your site (or retiring ones that are no longer serving a purpose).

It is tricky, however, to enact changes like these without affecting the social contract you’ve established with the community. Once you’ve codified a certain set of desired behaviors on your site, some users will (understandably) be upset if the rug gets pulled out from under them. This risk is yet another argument for avoiding disclosure of too many details about the mechanics of the system, or for downplaying the system’s importance.

Reputation from Theory to Practice

Parts I and II of this book focused on reputation theory:

  • Understanding reputation systems through defining the key concepts

  • Defining a visual grammar for reputation systems

  • Creating a set of key building blocks and using them to describe simple reputation models

  • Using it all to illuminate popular complex reputation systems found in the wild

Along the way, we sprinkled in practitioner’s tips to share what we’ve learned from existing reputation systems to help you understand what could, and already has, gone wrong.

Now you’re prepared for the second section of the book: applying this theory to a specific application—yours. Chapter 5 starts the project off with three basic questions about your application design. In haste, many projects skip over one or more of these critical considerations, and the results are often very costly.

Get Building Web Reputation Systems now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.