Broken Links Reported in Web Logs

Problem

You have a log for your website in the Combined Log Format. You want to check the log for any errors caused by broken links on your own website.

Solution

"(?:GET|POST)(?<file>[^#?"]+)(?:[#?][^"]*)?HTTP/[0-9.]+"404↵
(?:[0-9]+|-)"(?<referrer>http://www\.yoursite\.com[^"]*)"
Regex options: None
Regex flavors: .NET, Java 7, XRegExp, PCRE 7, Perl 5.10, Ruby 1.9
"(?:GET|POST)(?P<file>[^#?"]+)(?:[#?][^"]*)?HTTP/[0-9.]+"404↵
(?:[0-9]+|-)"(?P<referrer>http://www\.yoursite\.com[^"]*)"
Regex options: None
Regex flavors: PCRE 4, Perl 5.10, Python
"(?:GET|POST)([^#?"]+)(?:[#?][^"]*)?HTTP/[0-9.]+"404↵
(?:[0-9]+|-)"(http://www\.yoursite\.com[^"]*)"
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

When a visitor clicks a link on your website that points to a file on your own site that does not exist, the visitor gets a “page not found” error. Your web server will write an entry in its log that contains the file that does not exist as the requested object, status code 404, and the page that contains the broken link as the referrer. So you need to extract the requested object and the referrer from log entries that have status code 404 and a referring URL on your own website.

One way to do this would be to use your favorite programming language to write a script that implements Combined Log Format. While iterating over all the matches, check whether the “status” group captured 404 and whether the “referrer” ...

Get Regular Expressions Cookbook, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.