Live Monitoring Web-Server Log Files with Grep / Tail

March 24, 2008 10:52pm
Categories: Technical Tips

It's amazing how much information can be taken directly from a Web server's log file. Here's some tips for tracking traffic and extracting some cool information.

by Darrin J. Ward

Analytics packages (WebTrends, Coremetrics, Google Analytics, etc.) have come a long way in the past decade. They offer offer custom, multidimensional data extraction and comparison functionality, which is extremely useful when it comes to understanding macro website performance. Unfortunately, they do tend to miss out on some of the micro level. Fortunately, there are still ways to track live hits to the server, down to the last bit of detail. This article assumes a reasonable understanding of Unix, SSH and Web server log files.

Web servers generally log the requests they serve into a regular text file. The default log file for the Apache Web server is a file named "access_log". Once you know where the log files are located, it is possible to use a combination of linux commands (which probably have Windows equivalents) to perform a variety of operations for data extraction.

The two main commands that I use the most are "grep" and "tail". Grep, in essence, searches for patterns in files. Tail is capable of streaming files i.e. keeping them open and printing the output as other programs write to them.

SSH'ing Into Your Server
On Windows, you can use the PuTTY SSH program to SSH into your server. On Mac, you can use the OS X Terminal (ssh -l username domain.com). Once you're logged in, type "man grep" and "man tail" to read the grep and tail manual pages, respectively. Familiarize yourself with how they work. If you can't get to see the manuals, check with your administrator to ensure that grep/tail are installed.

Once you're logged into the server and have familiarized yourself grep/tail, navigate to the directory that hosts your Web server's log files, eg: "cd /home/darrinward/logfiles/".

All of the following assume that you are logged into the server, have navigated to the log files directory, and that grep and tail are installed and working. Further, I will use access_log as the default file name below - replace that with the actual name of your Web server's log file.

Looking at The Latest Web Server Hits

Type the follow command to see the last 10 hits to your Web server:

tail access_log

Type the following to see the latest X number of hits to the server (replace X with an actual number):

tail -X access_log

One thing that I absolutely love to do it track hits to the server in real time. I look at the IP's and get a sense of what people are clicking on, how long pages take to load, how people interact with the page using back/forward, etc. To track live hits to the server (real-time), use this command:

tail -f access_log

How you use this is completely up to you - but I guarantee you that you'll become addicted.

Tracking Robot Hits, etc. With Grep

Grep is very useful for finding specific hits within your log files. For example, we can extract all hits from Google search robot "Googlebot" with the following command:

grep 'Googlebot' access_log

Likewise, we can see all hits coming from Yahoo's Slurp with:

grep 'Slurp' access_log

Note that you can substitute anything you like as the search pattern (the part between the single quotes). The above examples rely on the "User Agent" for robot spider identification. If the user agent is not being tracked, then it will be much more difficult to track robots.

Combining Grep & Tail For Advanced Tracking

It is possible to "pipe" the output from one command to another (using the | Character). This essentially means that one command processes information, prior to passing it to the second command for processing. For example, the following command will "tail" stream your log file in real time, then "grep" hits that are from Googlebot:

tail -f access_log | grep 'Googlebot'

Again, you can modify the search pattern for grep.

Finally... here is the command that I like the most. It's primary purpose is to track visitors referred from search results pages in Google, Yahoo & Live/MSN.

tail -f access_log | grep 'www.google.\|search.yahoo\|search.live\|search.msn'

I hope these commands prove useful. Please drop me a line if you need help.

 
Submit RFP
TESTIMONIALS
“... possibly the reason I got into the SEO community was because of Darrin. Thanks Darrin!”
Barry Schwartz,
Executive Editor,
Search Engine Roundtable
CONTACT
Telephone:
+1 (866) 695-3949