Next: Traffic Analysis and Search Engine Optimization
Previous: Traffic Analysis

Log Files Limitations for Traffic Analysis

Before we go further and because it is the one area where commercial packages try to add value we should discuss the limitations of log files.

As we said above the logs show direct accesses to the server for a page. However there are lots of chances for the request to be satisfied before it ever reaches our server. In this case we will have no log entry even though the user is viewing our content. The user may already have viewed the page and his web browser will find a copy of the page directly on the hard disk in a zone known as a cache. Caching, that is storing a copy of some information close to the user, is a popular technique for speeding up access times to a slow resource - such as a web page. The page may also be cached on a proxy server run by the user's organization or Internet Service Provider. These are used to reduce outgoing traffic for frequently accessed resources. The web hosting company may also operate a special front-end cache, called an accelerator, to reduce load on its servers. Pages are kept in the cache for anything from 24 hours to a week before being refreshed. Sometimes caches are configured to check with the origin server (our website) to see if the resource has been updated without actually downloading the resource, in which case this will be counted as a request.

Accesses via a cache (sometimes called proxy) will mean that log files will underestimate traffic to our website. They also mean that lots of accesses will appear to come from a single computer. Remember the IP address in the log file represents a computer and not a user. Some large ISPs also host their caches across many computers each with its own address, thus a single user may have his requests split over a number of IP addresses.

The referrer information tells us which page the user was visiting immediately before requesting our page. This may be from an inbound link or search engine. In the latter case it will usually include the keywords used to find our page. The referrer information is not guaranteed to be correct and some web surfers run special software to disguise this field. The same goes for the user agent information. This is often necessary as some websites write pages for Microsoft's Internet Explorer and bar other browsers based on the user agent information even though those browsers could view the pages. This is not really in the spirit of the web and so surfers configure their browsers to send the same user agent information as Internet Explorer. User agent information will also tell you which search engine robots are visiting your website and is an integral part of creating cloaked webpages.

For simplicity the web communications protocol (HTTP) is stateless. That means that each request is stand-alone. This is a good thing on something as potentially unreliable as the Internet where crashes to computers and communication outages occur. It does mean that it is not possible to tell if individual requests are from the same user even if the IP address information is the same. Websites that need to track users around the site use features such as cookies to tie individual requests together. A cookie is a unique piece of information that is sent to the web browser when they first connect to a website. The browser then sends this information back each time it requests a new page on the same website.

All of this means that there is no 100% accurate way of tracking a single user's access to a site via log files. The only thing that can be said with any certainty is that the website received at least the number of hits recorded in the log file. Commercial packages such as Mach 5 and WebTrends give the paths that a user takes through the site. They construct this through a combination of IP address, time, referrer field and user agent. The assumption being that requests in the same time frame, say less than 30 minutes, from the same computer using the same web browser are probably from the same user. However they will not see pages accessed through the web browser back and forward buttons as these will be served by the local copy of the page. They will also miss pages fetched from the browser or other cache. There is also no way of knowing where the user went after they leave the site.

Many analysis tools also report website "stickyness", either in terms of time spent reading the site or the number of pages read per visit. A visit is defined as a series of requests from the computer/browser combination over a certain period. The time a user spends on a page is measured as the time between two page requests by the same user. There is no guarantee that the user actually spent this time reading the page. Overall time spent on the site is also inaccurate due to caching and the fact that the time spent on the exit page cannot be known. Print media suffers from similar, if not more serious deficiencies. A magazine publisher knows how many copies are printed, using an audit bureau they may even know how many are sold. However determining how many people see each magazine, or which articles they actually read can only be done by reader surveys. Keep these restrictions in mind, especially when listening to the inflated claims of salesmen.

Search Engine Optimization Book            

See Also

Home ] Table of Contents ] Start ]