Next: HTTP Redirects and Robots
Previous: Robots and Spiders

Common Problems with Search Engine Robots

An important thing to understand is that a page will not appear in a search engine's index until it has been visited by the search engine's robot. This is where an knowledge of log-file traffic analysis is very useful. Most robots identify themselves and can be found by looking at the website's log files. Real web hosting with access to raw logs is essential for SEO. Third party traffic counters will not provide this information as they rely on the browser downloading an image file or running some Javascript. Robots don't do this.

66.196.90.36 [01/Oct/2004:00:01:04 +0100] "GET /index.htm HTTP/1.0" 200 6025 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

The robot identifies itself using the User-Agent HTTP header. In this case it is Yahoo's slurp. It fetched the home page: index.htm on the 1st of October. Two interesting pieces of information that should be checked are the HTTP response code and the number of bytes sent. These are shown in bold, 200 is OKAY and the size is correct. Also note that there is no referrer information. Robots fetch pages based on their own list compiled by the search engine's indexing program not by following links around the site in real time.

The following table lists some of the most common search engine robots:

  User Agent Search Engine
  Scooter AltaVista
  Slurp Yahoo!, Inktomi, MSN Search, HotBot, Lycos Europe
  Googlebot Google
  msnbot MSN Algorithmic Search
  Sidewinder etc. Infoseek

Robots have a poor understanding of JavaScript, HTML Frames and multimedia content such as Flash animations. Each request for a page stands on its own. Robots can't manage cookies, session ids and don't provide referrer information. They are unable to access password-protected areas. This has the advantage that extraneous information such as member profiles in forums, effectively noise for a robot, can be hidden from any user not logged in to the site.

There are a number of reasons why a page doesn't appear in the search engine results pages

  1. The page is too deep within the site's hierarchy or not correctly linked. Check internal links and consider adding a site map to allow every page to be accessed within two jumps from the home page
  2. The web site was unreachable due to Name Server DNS or routing problems. Internet routing problems can be hard to track down, the page may be accessible from one location but not from the search engine.
  3. The web server was down. Normally search engines will retry a number of times before giving up or delisting a page. If the web hosting is unreliable it should be moved elsewhere.
  4. Dynamic pages are used. Dynamic pages are not in themselves a problem but long query strings and dynamic URLs pose problems for search engine robots.
  5. The site uses HTML Frames. This is common with domain cloaking.
  6. The site is protected by a robots.txt file
  7. The site has been page-jacked, at the time of writing this seems to be a feature in Yahoo's and Google's handling of redirects, the page will get indexed as a result of a redirect from another page, but it is the redirecting page that appears in the results. Googlebot or Slurp will appear in the log-files but the site can't be found in search results.
  8. The web server has trouble serving content to the robot. Robots are interested in content, they typically accept HTML and text pages. They indicate this to the web server by sending the HTTP Header: Accept: text/html and text/plain. Some badly configured web servers reject this with an HTTP 406 (no acceptable content) error. You can test this with the command line URL (cUrl) utility:

    curl -H "Accept: text/html"  http://mydomain.com/

There are web based tools that aim to show you a search engine's spiders view of your page:

http://searchengineworld.com/cgi-bin/sim_spider.cgi

As a rule the simpler the pages the more likely they are to get indexed by search engine robots. This is a case where flashy, corporate sites often lose out.

In order to work out which pages are not getting indexed and why it is necessary to dig a little deeper. Log file analysis tools can give some general trends but more specific problems may require a more detailed analysis of individual log entries. If your site is reasonably popular searching log files directly can be a daunting business. On a site with 100,000 page views per month the log file may hold close to a million entries.

The first useful piece of information is to see how many pages are indexed by search engines. The command

    site:yoursite.com

Works with MSN Search, Yahoo! and Google and will return a list of all the indexed pages. For a well structured site that has been around for some time this figure should be similar to the total number of pages on the site, give or take any changes made over the last couple of months such as adding new content.

If there is a big difference, and there often is, take a look through the checklist above. If you are running Linux or Mac OS/X or have the Redhat Cygwin toolset installed on your windows PC there are some useful text analysis tools that you can run from a command line window. Windows's users can load the log file into Excel using the space character as a column separator. The only problem is that Excel will not load more than 32,000 lines.

First of all find all the redirects and error lines for a the search engine's robot over at least the previous month. This can be done with grep (native Windows version of this command can be downloaded):

grep " 30.* *googlebot" mydomain.log
grep " 40.* *googlebot" mydomain.log

The first command says search for all the lines containing the string " 30" followed by a single character followed by any number of characters then finally the word "googlebot". This is grep regular expression syntax and is a very powerful tool for pattern matching for text in files. Further explanation is beyond the scope of this book. Teach Yourself Regular Expressions in 10 Minutes by Ben Forta explains the subject in more detail.

Redirects (HTTP 30* errors) are not a problem if the robot subsequently fetches the redirected page. HTTP 40* errors may be due to missing resources or server configuration problems. Search engine robots are very slow at updating their list of URLs to fetch and may take many months to drop deleted or move pages. This is reasonable as resources are sometimes temporarily unavailable when they come to call.

The following series of commands will extract all the requests by googlebot from the log file, the 7th column is cut from this list and is sorted and any duplicated removed:

    grep googlebot mydomain.log | cut -d' ' -f7 | sort | uniq

The resulting list can be compared with the files on the website to find out which resources are not being indexed by the search engine. Use the checklist to determine the nature of the problem. If the content is recent then the robot may not yet have the resource on its list of pages.

Search Engine Optimization Book  amazon.com buy button   amazon.co.uk buy button   lulu buy button    barnes and noble buy button

See Also

Home ] Table of Contents ] Start ]