Calculating Apache Hits per IP Address

Credit: Mark Nenadov

Problem

You need to examine a log file from Apache to know the number of hits recorded from each individual IP address that accessed it.

Solution

Many of the chores of administering a web server have to do with analyzing Apache logs, which Python makes easy:

def CalculateApacheIpHits(logfile_pathname):
    # Make a dictionary to store IP addresses and their hit counts
    # and read the contents of the log file line by line
    IpHitListing = {}
    Contents = open(logfile_pathname, "r").xreadlines(  )
    # You can use .readlines in old Python, but if the log is huge...

    # Go through each line of the logfile
    for line in Contents:
        # Split the string to isolate the IP address
        Ip = line.split(" ")[0]

        # Ensure length of the IP address is proper (see discussion)
        if 6 < len(Ip) <= 15:
            # Increase by 1 if IP exists; else set hit count = 1
            IpHitListing[Ip] = IpHitListing.get(Ip, 0) + 1

    return IpHitListing

Discussion

This recipe shows a function that returns a dictionary containing the hit counts for each individual IP address that has accessed your Apache web server, as recorded in an Apache log file. For example, a typical use would be:

HitsDictionary = CalculateApacheIpHits("/usr/local/nusphere/apache/logs/access_log")
print HitsDictionary["127.0.0.1"]

This function is quite useful for many things. For example, I often use it in my code to determine the number of hits that are actually originating from locations other than my local host. This function is also used to chart which IP addresses are most actively viewing pages that are served by a particular installation of Apache.

This function performs a modest validation of each IP address, which is really just a length check:

An IP address will never be longer than 15 characters (4 sets of triplets and 3 periods).
An IP address will never be shorter than 7 characters (4 sets of single digits and 3 periods).

The purpose of this check is not to enforce any stringent validation (for that, we could use a regular expression), but rather to reduce, at extremely low runtime cost, the probability of data that is obviously garbage getting into the dictionary. As a general technique, performing low-cost, highly approximate sanity checks for data that is expected to be okay (but one never knows for sure) is worth considering.

Python Cookbook by

Calculating Apache Hits per IP Address

Problem

Solution

Discussion

See Also

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly