The “Visit” Data Structure

Trying to track individual visitors via the entries in a web server’s access log is something of an exercise in futility. With things like proxy servers and client-side caching getting in the way, the series of accesses that show up in the log from a particular hostname or IP address can give only an approximate picture of what individual visitors are doing. Multiple users sharing the same IP address can have their activity merged into what looks like a single, very active visitor. Conversely, a single visitor can show up in the logs via a different IP address on each request, defying efforts to abstract those requests into a meaningful “visit.” A proxy server at a major ISP can cache the site’s pages, then satisfy hundreds of requests that never get recorded in the server’s logs.

Even so, it’s hard not to wonder what a log file would reveal if we could pluck out the requests corresponding to specific hosts and string them together to see what patterns emerge. Many users still browse from individual host addresses without intervening proxy servers; for these users, at least, the resulting “visit” tracking provides a fascinating look at the paths being followed through the site. It’s also interesting to see how many incoming requests are actually being generated by robot “spider” programs, and to study the behavior of those programs as they interact with the server. Finally, it’s an interesting programming exercise to see how we can assemble and present ...

Get Perl for Web Site Management now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.