Next: Robots - Common Problems
Previous: Recycling domains

Robots and Spiders

Search engines run programs called robots to automatically traverse a website's link structure retrieving documents. The process of retrieving pages from the website is often called spidering or crawling and robots are therefore also referred to as spiders or web-crawlers. A search engine will run many such robots simultaneously. The robot finds a web site either through a URL submitted directly to the engine by the webmaster, by a user visiting the site with one of the search engine toolbars installed on his browser or by following a link from another site.

The robot may not enter from the site's home page. This implies that every page of the site should be reachable from the entry page. The easiest way to achieve this is to put a link to the home page on each page within the site.

From the point of view of the web server the robot works in much the same way as a browser. It requests a page and downloads the content. Any URLs in the page are extracted and these will be added to the list of pages to be indexed. The page is then passed on to an indexer program. This will extract the content of the page including elements such as titles, headings and paragraphs and index them in accordance with the search engine's algorithms. Other rules may be applied such as keyword proximity. Search engines are using increasingly sophisticated algorithms which may also consider groups of pages in order to identify themes.

If the page is large the robot may only retrieve the first part of the page and may only follow the first few links. Google currently indexes around 100 kilobytes of a file (text and markup), Yahoo! will index somewhat more. It is possible to conduct an experiment by finding a long file: word lists, glossaries and dictionaries are a good source then querying terms that occur to the bottom of the page. For example the following query in Google:

"The Climbing Dictionary" abseil adze filetype:htm

Returns the a 163 kilobyte glossary of climbing terms, whereas

"The Climbing Dictionary" abseil adze portaledge filetype:htm

Returns a different web page because the term: portaledge occurs around the 110 kilobytes point in the page and so was not indexed. Interesting the cache version shows the complete file but claims it is only 101 kilobytes long. Yahoo, on the other hand, has no trouble with the second query and indexed all of the document. The Googlebot also takes a number of visits to follow a large number of outbound-links (50+) on a single page. The bottom line is that web authors should try to keep pages fairly short with the most important content and links towards the top of the page. Dividing content into shorter pages has a number of advantages:

Search Engine Optimization Book amazon.com buy button    amazon.co.uk buy button   lulu buy button    barnes and noble buy button

 

See Also

Home ] Table of Contents ] Start ]