Posted on by & filed under analytics, Programming & Development, Safari, Tech.

Like most companies, Safari’s interest in analytics data has only grown with time. Knowing how our users interact with our platform helps us to make a better product, and the more detailed questions we can ask about our users, the more successfully we can tailor our efforts to improve their experience. Commercial web site statistics platforms provide a relatively easy way to record and summarize analytics data and are enormously valuable in allowing companies to begin to understand their users. At Safari, for example, we’ve gotten a lot of benefit from using Google Analytics over the years, but at a certain point we began to want to answer questions that were very hard to “ask” of it.

Often this was because the data we needed was very granular and Google Analytics tends to want to answer questions about data in aggregate. On other occasions we needed to be able to stitch together data from our analytics software with other datasets that were internal to the company, which required reams of code to be written to fetch data from the Google Analytics API and reams more to perform entity resolution across those datasets. For commercial analytics platforms the trade-off you make for ease-of-implementation is that your data is never truly your own.

Over time we began to notice that more and more questions were being stymied by our analytics software, so we decided to start gathering analytics data on our own as a supplement to rather than replacement for Google Analytics. One of our most ambitious (and ongoing) efforts was to create a database of every request to our site. This kind of task is traditionally accomplished in one of two ways: either client-side by including a JavaScript file in all pages that records analytics events or on the server-side by processing log files the site’s web servers generate. There are trade-offs for both approaches that are outside the scope of this blog post; suffice to say we chose the latter approach. This blog post series will discuss the advantages of using Haskell for log parsing—providing examples of parsing different types of data that are commonly found in log files—and show how to build a fully-functional, self-contained log parser in Haskell.

The log file pipeline

At Safari we use Varnish, a reverse proxy that is useful for caching content that is static while passing requests for dynamic content upstream to a web application. Besides the performance and scalability benefits Varnish provides, its logging is highly customizable. Varnish can be configured to log information about the cookies a user has, the referring URL, the length of time our platform took to fulfill the user’s request, and much more. Every single request to a Safari domain comes through Varnish, so it’s a useful point from which to gather data for our analytics database.

Once we had configured Varnish to log all the data we were after, we used rsyslog to aggregate logs from all of our Varnish servers and deliver them to a place where they could be analyzed. Finally, the very large files generated by this setup were managed by logrotate.

Once we had log files containing all the information we needed being delivered to a place where they could be processed, my job was was to extract structured data from those files and store it in a database. This is where Haskell came in.

Why Haskell?

Because these log files represented all traffic coming to our site, each chunk tended to be many gigabytes in size, consisting of tens-of-millions of rows. Knowing that, I had two goals for the log parser I knew I’d have to write to extract data from these logs:

  1. I wanted to be able to parse them as quickly as possible using as little memory as possible.
  2. I wanted to get away with writing as little code as possible to accomplish this.

We write a lot of Python here at Safari, in fact it’s our most widely-used language. Because it’s a very high-level language, it was likely it would satisfy my second requirement. But my experience with writing log parsers in Python before–even when using tricks like lazy evaluation–led me to suspect that it would not do so well in satisfying the first requirement.

Conversely, something like C or C++ (or even Java) might have allowed me create a much more efficient parser, but would resulted in a considerably larger code base as well. Haskell seemed like it might be a language that would allow me to maximize for both goals. Although we have some engineers at Safari who have worked with Haskell in the past, this would be the first Haskell product put into production at the company; I decided to prototype a log parser in Haskell and see where things ended up.

The results were quite promising: I was able to pull together a working prototype in just a few days. It was impressively fast compared to Python-based log parsers I’d built in the past and worked in more-or-less constant space; no matter how big of a file I gave it, memory usage stayed around 5MB. The code base was also impressively small in size. This is due partly to the expressiveness of Haskell as language in general, and partly to the fact that there are several libraries available in the language that are extremely well-suited to log parsing. The performance of the prototype was good enough that I decided to complete the project and put it into production.

In a second blog post I’ll cover some of the more down-to-earth concerns with parsing log files and developing Haskell applications in general. For now, let’s take a look at the theory and mechanics of log parsing with Haskell.

Writing a log parser in Haskell

Let’s say you have a file full of lines that look like this:

You might recognize this as a log line in what’s called NCSA Combined Log Format, a format that’s output by many web servers. If you break it down into its constituent pieces, you get:

  • The IP address the request originated from:
  • The rfc931 identifier and user name; since these are unknown they show up as - and -
  • The date, time, and timezone of the request: 2015-01-13T14:30:19Z
  • The request method: GET
  • The URL requested: /register/
  • The status code of the request: 200
  • The size of the response in bytes: 1043
  • The referer:
  • The user agent: Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; Touch; rv:11.0)
    like Gecko
  • The value of a cookie called “sessionid” sessionid=axzx5ksfugbf36gqmeindmdeyqllg4pe

If you had to write something to parse this information out, you might first choose a regular expression:

This would parse out the basic units listed above, which would all then need further validation and parsing. But a regular expression-based approach has a couple of problems: Firstly, it’s not very easy to read. If you spend a lot of time with regular expressions, you can probably figure it out after a bit, but this particular regular expression doesn’t even do all that much. We could try to make it more sophisticated, of course; for example, it doesn’t do anything to validate the IPV4 address. So we could need to replace ([^ ]+) with this regex to do that:

We could do something similar for the date with this regular expression:

But as you can see, this is starting to get ugly really fast, and we haven’t even gotten to the hard part, like parsing user agent strings. These regular expressions don’t necessarily support good error handling either. And if we had different variants of data that are allowed in the log file, depending on the situation, we can’t really handle them all without resorting to backtracking. However, backtracking will make regular expression performance nose-dive and may, in some pathological cases, completely blow up memory usage. Furthermore, a large,complex regular expression is difficult to test. Wouldn’t it be nice if we could break down the parsing of a log line into a number of small, simple, easily-tested units and then combine them to make a full parser instead?

Haskell allows us to do this with decidedly better approach: parsers and parser combinators. Think of a parser as a function that consumes all or part of a string and returns some structured interpretation of it along with the rest of the unconsumed string. Parser combinators allow us to combine small, simple parsers into more complex ones by sequencing parsers according to the order of the things they parse in a log file line. When it comes to this approach, Haskell comes to the log parsing game with a decidedly unfair advantage: it has not one, but two industrial-strength, full-featured, and mature parsing libraries: parsec and attoparsec.

Knowing which to choose largely depends on your requirements. Parsec is the slower of the two, but it allows users to produce more detailed error messages on parse errors. If you wanted to be able to parse source code files (which are generally not that big) in a particular language and provide detailed messages about, for example, syntax errors, parsec is a great choice. But if you need to parse very large volumes of data and don’t care as much about error messages, then attoparsec is the way two go.

For parsing large log files, we care very much about performance. If, occasionally, a malformed log line appears we can note that fact and continue on, but we don’t necessarily need detailed information about why it was malformed. For these reasons we chose attoparsec.

Through the rest of this post, we assume the reader has a basic working knowledge of Haskell. If you’re totally new to Haskell but the subject interests you, there’s an excellent overview on the Haskell website, plus a number excellent books on the subject in Safari’s catalog.

Simple parser 1: HTTP method

Using attoparsec, let’s write a parser that will correctly handle one small, simple task: parsing the HTTP method in the log file:

Here you see a parser at its simplest. With attoparsec, parsers will always return some structured data (in this case a String) in the Parser monad. [1] This parser consists of two actions that are sequenced with *>:

  1. Consume a sequence of bytes matching the length of the input string "GET" and return that string if they match (string "GET")
  2. Return a string–”Success!”–to the parser (return "Success!")

If the first action succeeds, the next one will be evaluated. This parser isn’t very interesting but we can verify that it works with the parseOnly function, which will apply our parser to the given input and return the result:

The parseOnly function returns the result of a parse in the Either type. The either type is one which always has either a “left” or a “right” value. It’s often used for error handling: a Left value means an error occured and the value is a string explaining what went wrong. A Right value indicates success and holds the result. It’s used that way here. Let’s try to parse some “bad” data:

(The error message “Failed reading: takeWith” is an example of the less-than-helpful results you can get with attoparsec as opposed to parsec.) This parser is, indeed, small and simple. In fact it’s so simple, it only works with one possible HTTP method: GET. Let’s improve it to work with all the methods enumerated in the HTTP spec:

In this example, we begin to see the power of parser combinators. We can create a parser for each HTTP method and then simply combine them together using the associative binary operator <|>. The parser will return a string matching the HTTP method it encounters in the log file. What this code effectively means is:

With nothing more than *> and <|> we’ve built-up a more complex parser from a chain of very simple parsers. One important thing to note about string is that this particular parser consumes no input if it fails. This is what allows us to chain these parsers together in this way. Some attoparsec parsers consume their input on success or failure, so it’s important to be aware of the behavior of the parser you’re using. Fortunately, parser behavior is well documented in attoparsec. Let’s make this parser a little more fault-tolerant:

Firstly, we’ve replaced string with stringCI which is the case-insensitive version. Secondly, at the very end of our chain, we’ve now added one final parser that is always guaranteed to succeed because all it does is return the string “Unknown”. This parser now has a fall-back “default” value if the HTTP method is not recognized. Conversely, if we want to be strict in the input we allow, we might do this instead [2]

This shows how, with some planning, attoparsec-based parsers can emit more useful error messages when needed:

Let’s wrap this example up with a final improvement:

In this example, we first define a new type called HTTPMethod. In type-system theory is called a “sum type” because we can define all possible representations: they are simply the allowed methods enumerated in the HTTP spec plus a fall-back called Unknown. A value of the HTTPMethod type must be a Get, Post, Put, etc. But it has to be one of these and cannot be anything else. In this way, we can see how we can model domain-specific information in our type system. This is a very powerful feature of Haskell. For example, if we wanted to count up all the GET requests in a log file, we could do this:

The main function defines a couple of steps:

  1. Read the logfile contents. [3]
  2. Split the logfile up into a list of lines within the file.
  3. Apply the parser to our list of lines, then use rights to extract only the Right values from a list of Either types (effectively discarding lines that failed to parse correctly)
  4. Count up the number of Get results.

The countType function uses pattern-matching on its arguments and returns 1 if it is called with a Get type, and 0 for anything else. If we map this function over our results, we should have a list of ones and zeroes which we can then simply sum up to get the number of “GET” requests; this is what the countGets function does.

Next, let’s define a parser for the HTTP status code. We’ll step through this more quickly now that the basics are clear.

Simple parser 2: HTTP status

We expect an HTTP status code to be in the range of 200 – 505 so our parser’s type signature could be something like parseHTTPStatus :: Parser Int. Not every integer in that range is valid, but we don’t need to be too picky here. However, if our parser encounters a value outside of that range, we probably need to do something as it’s obviously not valid. We could call the fail operation in a monadic context, and end up having the parser throw out the whole line, but that might be throwing out the baby with the bathwater. Instead, let’s have the parser return a Maybe Int, with Just Int if the value is in range and Nothing otherwise:

Here we encounter another attoparsec parser: decimal. This returns an unsigned decimal number, consuming the input until it encounters a non-numeric character. In the where-binding of the function we also define a helper function validate which takes the parsed integer and returns a Just Int if the status code is valid or Nothing if it is not. We can map this validation function over the value of the functor returned by decimal with <$>.

Putting it all together

In the next blog post, we will cover parsing out more of the NCSA Combined Log Format. In the mean time, now that we can parse out status code and method, let’s imagine a much simpler log file format with just the information we’ve dealt with so far:

We have an HTTP method followed by a space, followed by the HTTP status code. (The log file also has some values which are clearly incorrect.) How can we combine parseHTTPStatus and parseHTTPMethod to handle this? Very easily, it turns out. We’ve already defined an HTTPMethod algebraic data type; let’s think about our parser in terms of other types it will need to work with:

We now have an alias for Int called HTTPStatus, and we have defined the return type of our parser: a tuple of the method and the status code. Now that we have these types, we can write a parser for the full log file line:

Given a log file line, the parser will parse the method, then a space (using space) then the status code. Combining parsers using do-notation makes for a very easy-to-read approach, but it’s also an imperative one. We could rewrite this in a fully applicative style that accomplishes the same thing:

If you’re comfortable with the Control.Applicative standard library this second approach probably looks nicer, but if you’re not, it’s definitely harder to read. Either approach is fine, so choose what you’re comfortable with. [4] We can now parse our simplified log format easily:

This function will read a file, parse it and return a list of all results that succeeded.

We’ve covered some of the theory behind parsers and parser combinators and built a simplified log parser using attoparsec. In the second part of this series, we’ll discuss:

  • parsing more complex data,
  • performance considerations,
  • packaging everything as a proper haskell project,
  • and building an executable logparser command that we can deploy.


[1] You don’t necessarily need to worry about the parser monad at this point; it’s responsibility is feeding input to the parser and handling errors, among other things.

[2] The fail function is part of the monadic Typeclass. fail is generally disliked within the Haskell community and should not be used unless you are sure the monad in which you evoke it actually overrides the default fail behavior, which is to print the message to the stderr and exit the program. Fortunately, attoparsec treats a call to fail more sensibly: when called within Parser, any further attempts to parse the input will be stopped and the error message returned as Left {{error message}}. In cases like this where we are parsing a log file one line at a time, if we encounter a line with a value that seems hopelessly wrong, telling attoparsec to just give up and go to the next line may be a perfectly valid behavior.

[3] You may have noticed that we are reading in log data as a Bytestring rather than a standard Haskell string. The reasons for this will be discussed in in the second part of this series.

[4] Sequencing parsers applicatively allows the compiler to perform static analysis on a parser without running it. This knowledge can be used to avoid things like backtracking that may slow your parser down. This is not possible when sequencing parsers monadically because the grammar of each parser depends on the previous one. However, performance results in this case are probably negligible; don’t hesitate to choose do-notation if you find it easier to read.


One Response to “High-performance Log Parsing in Haskell: Part One”

  1. Alex Dean

    Nice article James. Picking up the logs from Varnish rather than your web servers nicely ensures that you get all page views regardless of server-side caching; of course you will still miss anything that is being in any way client-side cached. If you do end up going down the client-side route to pick up custom events, browser features, page dwell etc, or if you want to start tracking events from mobile apps or server apps, do check out the Snowplow project: (disclaimer: I’m a co-founder).