Cover by Steven Levithan, Jan Goyvaerts

Safari, the world’s most comprehensive technology and business learning platform.

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required

O'Reilly logo

Broken Links Reported in Web Logs

Problem

You have a log for your website in the Combined Log Format. You want to check the log for any errors caused by broken links on your own website.

Solution

"(?:GET|POST)(?<file>[^#?"]+)(?:[#?][^"]*)?HTTP/[0-9.]+"404↵
(?:[0-9]+|-)"(?<referrer>http://www\.yoursite\.com[^"]*)"
Regex options: None
Regex flavors: .NET, Java 7, XRegExp, PCRE 7, Perl 5.10, Ruby 1.9
"(?:GET|POST)(?P<file>[^#?"]+)(?:[#?][^"]*)?HTTP/[0-9.]+"404↵
(?:[0-9]+|-)"(?P<referrer>http://www\.yoursite\.com[^"]*)"
Regex options: None
Regex flavors: PCRE 4, Perl 5.10, Python
"(?:GET|POST)([^#?"]+)(?:[#?][^"]*)?HTTP/[0-9.]+"404↵
(?:[0-9]+|-)"(http://www\.yoursite\.com[^"]*)"
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

When a visitor clicks a link on your website that points to a file on your own site that does not exist, the visitor gets a “page not found” error. Your web server will write an entry in its log that contains the file that does not exist as the requested object, status code 404, and the page that contains the broken link as the referrer. So you need to extract the requested object and the referrer from log entries that have status code 404 and a referring URL on your own website.

One way to do this would be to use your favorite programming language to write a script that implements Combined Log Format. While iterating over all the matches, check whether the “status” group captured 404 and whether the “referrer” ...

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required