|
Next: HTTP Redirects and Robots Common Problems with Search Engine RobotsAn important thing to understand is that a page will not appear in a search engine's index until it has been visited by the search engine's robot. This is where an knowledge of log-file traffic analysis is very useful. Most robots identify themselves and can be found by looking at the website's log files. Real web hosting with access to raw logs is essential for SEO. Third party traffic counters will not provide this information as they rely on the browser downloading an image file or running some Javascript. Robots don't do this.
The robot identifies itself using the User-Agent HTTP header. In this case it is Yahoo's slurp. It fetched the home page: index.htm on the 1st of October. Two interesting pieces of information that should be checked are the HTTP response code and the number of bytes sent. These are shown in bold, 200 is OKAY and the size is correct. Also note that there is no referrer information. Robots fetch pages based on their own list compiled by the search engine's indexing program not by following links around the site in real time. The following table lists some of the most common search engine robots:
Robots have a poor understanding of JavaScript, HTML Frames and multimedia content such as Flash animations. Each request for a page stands on its own. Robots can't manage cookies, session ids and don't provide referrer information. They are unable to access password-protected areas. This has the advantage that extraneous information such as member profiles in forums, effectively noise for a robot, can be hidden from any user not logged in to the site. There are a number of reasons why a page doesn't appear in the search engine results pages
There are web based tools that aim to show you a search engine's spiders view of your page: As a rule the simpler the pages the more likely they are to get indexed by search engine robots. This is a case where flashy, corporate sites often lose out. In order to work out which pages are not getting indexed and why it is necessary to dig a little deeper. Log file analysis tools can give some general trends but more specific problems may require a more detailed analysis of individual log entries. If your site is reasonably popular searching log files directly can be a daunting business. On a site with 100,000 page views per month the log file may hold close to a million entries. The first useful piece of information is to see how many pages are indexed by search engines. The command site:yoursite.com Works with MSN Search, Yahoo! and Google and will return a list of all the indexed pages. For a well structured site that has been around for some time this figure should be similar to the total number of pages on the site, give or take any changes made over the last couple of months such as adding new content. If there is a big difference, and there often is, take a look through the checklist above. If you are running Linux or Mac OS/X or have the Redhat Cygwin toolset installed on your windows PC there are some useful text analysis tools that you can run from a command line window. Windows's users can load the log file into Excel using the space character as a column separator. The only problem is that Excel will not load more than 32,000 lines. First of all find all the redirects and error lines for a the search engine's robot over at least the previous month. This can be done with grep (native Windows version of this command can be downloaded):
The first command says search for all the lines containing the string " 30" followed by a single character followed by any number of characters then finally the word "googlebot". This is grep regular expression syntax and is a very powerful tool for pattern matching for text in files. Further explanation is beyond the scope of this book. Teach Yourself Regular Expressions in 10 Minutes by Ben Forta explains the subject in more detail. Redirects (HTTP 30* errors) are not a problem if the robot subsequently fetches the redirected page. HTTP 40* errors may be due to missing resources or server configuration problems. Search engine robots are very slow at updating their list of URLs to fetch and may take many months to drop deleted or move pages. This is reasonable as resources are sometimes temporarily unavailable when they come to call. The following series of commands will extract all the requests by googlebot from the log file, the 7th column is cut from this list and is sorted and any duplicated removed: grep googlebot mydomain.log | cut -d' ' -f7 | sort | uniq The resulting list can be compared with the files on the website to find out which resources are not being indexed by the search engine. Use the checklist to determine the nature of the problem. If the content is recent then the robot may not yet have the resource on its list of pages. See Also
|
|
©1994-2006 All text and images copyright: www.abcseo.com; last updated: |