Chapter 42 Processing Logs and Analyzing Site Use

NCSA Logs
Off-the-Net Tools
Evaluating Effectiveness

Chapter 41, "How to Keep Them Coming Back for More," shows the depth of information about a site that you can get from the access logs, also known as transfer logs. Many tools on the Net can help you make sense out of this, and other, logs. This chapter explores these logs and tools, and recommends how to improve site efficiency based on log analysis.

The most common types of logs include the access logs, the referrer log, and the user_agent log. Analysis tools are avail-able for each type of log, as well as special access logs from non-HTTP servers. The error log is generally not analyzed automatically. Rather, it should be reviewed by the Webmaster for patterns and trends.

NCSA Logs

Chapter 41 introduces the common log format for access logs, also known as transfer logs. Many servers can be configured to keep other logs as well.

Note

Check your log directory and httpd.conf for the agent_log and referer_log. If you've configured for these logs and they aren't appearing, go back and check the makefile in the install kit. Some servers come with those logs disabled by default. For Apache, follow the instructions given in the install file, which tells you what lines in the Configuration file to uncomment and how to rebuild the server. It's not enough to just add them to the makefile-Configure has to write a bit of C code to tell the server that these log types exist.

Error Logs

Reading the error log is so valuable that a good Webmaster will check it by hand regularly. The following sections cover some of the errors you may see.

file does not exist

If this message comes up regularly, perhaps there's a bad link, or the URL has been given out incorrectly.

malformed header from script

If you see this message, the log should also mention a CGI script that's not responding with valid HTTP. Most likely the script is failing to compile, and Perl is throwing errors. You may also be able to find the error Perl threw. If the server is busy, use grep to filter for your host so that you can see which lines are yours. See Chapter 8, "Six Common CGI Mistakes and How to Avoid Them," for details on how to troubleshoot failing Perl CGI scripts.

Tip

When you're logged in locally, you can still use a browser to point to the server rather than file:. If the request goes through server, it gets logged as coming in from server localhost.

file permissions deny server execution

Just like this error says, someone has tried to run a file that isn't enabled for execution (with the chmod command). Check the file name-maybe it's not intended to be executed. If it is, make sure that the file permission bits are properly set with chmod.

script not found or unable to stat

The server couldn't find the script named or, if it found the script, couldn't get any data about it. The most likely explanation is that the script doesn't exist at the specified URL. If it exists, check the permission bits. Remember that the server usually runs as user nobody.

invalid CGI ref

The server knows the browser is trying to access a CGI script through an include, but the associated CGI script can't be run. (Possibly, it can't be found.)

task timed out for host

When the server is about to do something time-consuming, such as talk to the outside world, it sets a timer. If the process at the other end hangs or dies for some reason, the timer eventually goes off and the server continues with its work after logging this error.

attempt to invoke directory as script

This message is fairly self-explanatory. The server logs the error and moves on.

unable to include file1 in parsed file file2

File2 is trying to use SSIs to include file1. The server can't include file1. Possibly, there's a permissions problem, or the file may not exist.

Other Errors

On many servers, anything a CGI script writes to stderr is captured to the error log. Use this fact to help keep CGI scripts error-free.

Referer Logs

Chapter 4, "Designing Faster Sites," mentions various headers the browser can send to the server. One of those is Referer. Try this experiment-visit your Web server via Telnet and ask for a document:

$telnet www.xyz.com 80
Trying...
Connected to www.xyz.com.
Escape character is "^]".
GET / HTTP/1.0
Referer: some other page

Now go to the logs directory under ServerRoot and find the file referers. It should contain

some other page -> /

The referers log gives a first cut at the sort of link count analyzer described in Chapter 41, "How to Keep Them Coming Back for More."

User-Agent Logs

Another header that can accompany the request is User-Agent. Repeat the experiment from the preceding section, but this time, ask for a document and give a User-Agent:

$telnet www.xyz.com 80
Trying...
Connected to www.xyz.com.
Escape character is "^]".
GET / HTTP/1.0
User-Agent: foo 1.0

The result will appear in the agent_log in the logs directory.

Both referer and agent logs give useful information. This information might be even more useful if it appeared in the transfer log, so it could be associated with each access. Some browsers allow this behavior on a regular or experimental nature.

Off-the-Net Tools

A variety of tools are available on the Net for reading the log. The vast majority focus on counting hits, or a similar performance metric. As mentioned in Chapter 41, "How to Keep Them Coming Back for More," hits aren't as useful as other information, such as dwell time and link counts that may be extracted out of the log. The tools are classified in the following sections by which log they examine.

Access Analyzers

These analyzers work on the access log. Access logs, usually the largest of the logs, can grow 100M a day on a busy site. Here are some things to look for in an access log analyzer:

Utility. Does it answer the questions the Webmaster and client are asking?
Usability. Does it reduce the data enough to give a succinct and understandable answer to those questions?
Speed. Is it fast enough to get the answers without becoming a burden to the server?
Incremental operation. Can it remember where it is in the log, or does it always start from the beginning?
Price. Is it affordable, maybe even free?

wwwstat

wwwstat, available at http://ics.ucl.edu/WebSoft/wwwstat/, when used with the metasummary script from http://www.ai.mit.edu/tools/usum/usum.html, compresses a 5M daily log down to a digest that you can read in your morning e-mail. The answers are mainly about system load, although it does some useful filtering (for example, GIFs versus non-GIFs) to give you a better idea of what's being accessed, and not just how frequently.

wwwstat is one of the slower analyzers (running at about two percent of the speed of getstats 1.3 beta), but it affords incremental operation, so you could run it daily to keep up with the load.

wusage

wusage is a shareware program available at http://www.boutell.com/wusage/. In addition to counts, it produces a Popular Documents Report and a Frequent Sites Report. The program is highly configurable, allowing you to select which reports are produced and to what degree of detail.

wusage runs at about five percent of the speed of getstats and is incremental. The output is available in GIF files, such as the one shown in Figure 42.1, to make it more comprehensible.

Figure 42.1: GIF files make the output from wusage more understandable.

getstats.c

This program, available at http://www.eit.com/software/getstats/getstats.html, produces detailed analyses by time (at many levels of granularity), domain, request, and directory tree. There's a front end for getstats called CreateStats.

Getstats is one of the fastest log analysis programs available, clocking an amazing 21,000 lines per second in a test done at Uppsala University. The results are posted at http://www.uu.se/Software/Getstats/Performance.html. Furthermore, getstats is capable of incremental operation, so the entire log doesn't have to be analyzed at once.

Note

getstats isn't in the public domain, but the developers permit it to be distributed freely as long as it's unchanged from its original distribution.

Analog

Analog is another fast log analyzer-the only one in the Uppsala study to rival getstats. Its results are primarily counts, although it has nice reports on top referers and requests. Summaries are available for each report with limits set by the user. Analog is free at http://www.stats.lab.cam.ac.uk/~sret1/analog/.

Analog is similar to getstats. If you like getstats, look at Analog and choose the format you prefer.

getstats_plot

As its name suggests, this program is particularly strong in producing various access plots, such as the one shown in Figure 42.2. This plotter is available at http://infopad.eecs.berkeley.edu/stats/.

Figure 42.2: getstats_plot shows a graphical view of the data by month.

WebReport

WebReport comes with the NCSA server (available at http://hoohoo.ncsa.uiuc. edu/) but can be set up for any system that supports the Common Log Format (described in Chapter 41, "How to Keep Them Coming Back for More").

WAMP

WAMP, available at http://www.wwu.edu/~n9146070/wamp.html, is good at one thing-which domains accessed the site. Figure 42.3 shows a sample of the output from this Perl script.

Figure 42.3: Part of WAMP's output shows the number of accesses by domain.

Statbot

Statbot is one of the more comprehensive analyzers. It works by building a database from the log, and then running queries against the database. Various reports are available. Statbot adds new log entries as it discovers them.

Statbot is available as shareware at http://www.xmission.com/~dtubbs/club/cs.html. Source code is available to registered users.

WebStat

WebStat is a newer program that runs a detailed analysis of the traffic patterns reflected in the access log. Reports include traffic patterns of users as they move through the site. A commercial product from Huntana, it is described at http://www.tgc.com/websec/20274.html.

Combined Log File Handling System

This program is best known for its capability to handle large input sets from multiple browsers (for example, FTP, Gopher, and HTTP). More information is available at http://www.hensa.ac.uk/tooks/www/logtools/.

fwgstat

fwgstat is an older program that reads the most common formats from several servers: FTP, Gopher, and the HTTP Common Log Format. It's available at http://sunsite.unc.edu/jem/fwgstat.html.

pwebstats

This Perl script works with proxy/cache servers as well as conventional systems. It's rather slow, running at about two percent of the speed of getstats, and doesn't support incremental analysis. Details and the source are available at http://www.unimelb.edu.au/pwebstats.html. Figure 42.4 shows the fine detail available from this script.

Figure 42.4: Pwebstats offers a fine degree of detail that can be useful for many sites.

Emily

Emily's graphical output tracks international, national, and local accesses separately. This division can be useful for a campus or large company, so that intranet accesses are plotted separately from non-local accesses.

Emily is available at http://www.curtin.edu.au/~glenn/products/emily/.

Web-Scope Statistics

Web-Scope is one of the few analyzers that reports visitors' paths through the site (although it doesn't compute dwell time). In addition to its detailed reports, Web-Scope can generate a summary report for the past 16 days. It also reports an interesting statistic-"pages per visitor."

Figure 42.5 shows the report for Web-Scope. Web-Scope is a commercial service available in real-time. It's described at http://www.tlc-systems.com/dir.html.

Figure 42.5: Manual examination of WebScop's output shows paths, depth, and dwell time for each visitor.

InterGreat_WebTrends

Like most programs described here, this program handles input in the Common Log Format. It's described at http://www.egsoftware.com/webtrend.html. Figure 42.6 shows sample output.

Figure 42.6: WebTrends gives a quick look at the distribution of incoming requests by site.

AccessWatch

This Perl script looks at the data a bit differently than many of the programs and scripts described earlier. It follows many of the same statistics as other analyzers but makes heavy use of graphical output. The Uppsala team analyzed AccessWatch, although not as part of its major study. The team found AccessWatch to have a throughput well under one percent of GetStats-around 48 lines per second, compared to GetStat's 21,000 lines per second. See the sample output in Figures 42.7 and 42.8. AccessWatch is available at http://www.eg.bucknell.edu/~d/.

Figure 42.7: AccessWatch provides an amazingly concise summary and projection of site activity.

Figure 42.8: The "Acceses by Domain" report gives a rough estimate of penetration into various markets.

W3Perl

Unlike the other programs in this section, W3Perl looks at referer and agent logs as well as the transfer log. It's the slowest analyzer tested in the Uppsala study, coming in at less than one percent of the throughput of getstats, although it does handle incremental input. Figure 42.9 shows its output. More information about W3Perl is available at http://www.club-internet.fr/~domisse/w3perl/Docs/html/index.html.

Figure 42.9: Showing referring pages gives a quick indicator of where visitor are finding your site.

MK-Stats

MK-Stats is an exceptional program. The reports are highly user-oriented and may be customized. It can handle multiple input files and produced output graphics in ray-traced format as well as the more traditional graphical and textual formats. It's available at http://web.sau.edu/~mkruse/mkstats/. Information about the ray-traced output is available at http://web.sau.edu/~mkruse/www/scripts/access3.html.

3Dstats

Like MK-Stats, 3Dstats moves Web statistics to the third dimension. It doesn't include a ray tracer in its own package; instead, it outputs VRML, which can be viewed in any VRML browser. For more information, visit http://ww.netstore.de/Supply/3Dstats/.

Multi-WebServer Statistics Tool

This program, formerly known as mw3s, produces statistics from several servers. It has graphical output, although the author warns "this feature only works with Netscape."

This tool consists of two programs: logscan is run as a CGI script, and loggather runs from the crontab. When loggather runs, it invokes logscan on each server where statistics are being gathered and generates new WebCharts. This tool's distinctive feature is its capability to produce a Top 20 list across more than one server. For more information, visit http://engleberg.dmu.ac.uk/webtools/mw3s/mw3s.html.

Agent Analyzers and Referer Analyzers

Agent_log analyzers help you answer the question, "Which browsers are my visitors really using?" Referer_log analyzers address the question, "Where did visitors come from?" Part of that answer tells you the path visitors take through the site. Another part tells which external sites have links that are building traffic. (This information is particularly important if the site has paid for a click-through ad on an external site.)

At one time, several dedicated programs read these logs. The trend has been to move this functionality into the programs that are already being used to read the access logs. For example, W3Perl, described earlier, will examine the agent and referer logs, as shown in Figure 42.9.

One of the challenges an agent log analyzer must deal with is so-called cloaked browsers. For example, Microsoft Internet Explorer identifies itself both in its own name and as Mozilla (that is, Netscape). The better browser analyzers can discriminate between a true name of the browser and the cloaked name. Figure 42.10 shows the output of BrowserCounter, from <http://www.netimages.com/~snowhare/utilities/browsercounter.html>;.

Figure 42.10: Knowing how many visitors use enhanced browsers like Netscape allows the Webmaster to decide how to design the site.

RefStats is a dedicated referer analyzer available at http://www.netimages.com/~snowhare/utilities/refstats.html (see Fig. 42.11).

Figure 42.11: Knowing where visitors found the link to each page gives the Webmaster information about how visitors move through the site.

Evaluating Effectiveness

The best measure of a site is how well it accomplishes its goal. Most sites will have one or more forms where users respond to place an order or to ask for more information. One good way of measuring effectiveness is, first, to track those people who actually complete the site process and fill out the form. Look for patterns to their usage-do they seem to be more interested in personal credentials, or product specifications? Then make sure that the site provides plenty of material for this sort of user.

Next, look at those visitors who stop at various points in the process. How many made it to the order form, but didn't place an order? How many spent quite a long time (as measured by dwell time) in the product catalog, but never selected an order form? Examine these patterns and try to understand them. Consider using Red Team members (a concept introduced in Chapter 2, "Reducing Site Maintenance Costs Through Testing and Validation") or other "friendly evaluators" to understand what's working on the site.

If you can set the transfer log to capture agent and referer information, look for correlations. Are the Mosaic users all stopping at about the same point? Perhaps that sequence of pages is unappealing in Mosaic. Look for users who are running with graphics disabled. Do they generally look at most of the site? If they leave, is there a pattern to where they leave? Do any of them download any graphics, or turn on graphics viewing at some point? What does this information tell you about those pages?

Although statistics are available that cover almost every conceivable aspect of the Web site, the real test is to determine if the site is meeting its goals and objectives.

This chapter presented a set of tools for analysis of the various logs, and recommended what to measure and what patterns to look for in determining whether a site is meeting its goals and objectives. Chapter 43, "How to Apply These Lessons to the Intranet," reviews the lessons of this book in the context of private networks and servers-the Intranet.

Chapter 42

Processing Logs and Analyzing Site Use

CONTENTS