|
Next: Robots and Macromedia Flash Robot WranglingWell behaved search engine robots check whether they are welcome before crawling all over a site. When they first visit they check for a file called robots.txt in the home directory. After accessing a page robots may also look at Meta tags in the page's header. These give certain ground rules a robot should obey.
Robots.txt consists of 1 or more records separated by blank lines listing the robot's User Agent string and the resources it is not allowed to access:
Comments begin with a hash ' #' character. The wildcard '*' is used to specify all robots. We may wish to exclude the robot from temporary files or administration scripts that we don't want to be found easily. Do not rely on this for security. If you don't have access to the home directory of your server you can use Robots META tags. Interpretation can differ slightly between robots.
if you only want to instruct Google's robot replace ROBOT with googlebot. You can use a combination of the following terms:
Google Image SearchSome websites don't like their images being indexed directly by Google. Recently Perfect 10 magazine went as far as trying to sue Google for copyright infringement on the grounds that Google reproduces a picture of the scantily clad young ladies that feature in Perfect 10 in its results. Google uses a separate robot for image search, the following entry in robots.txt will keep the image crawler off your site:
Google's URL ControllerThe URL controller <http://services.google.com/urlconsole/controller> lets you remove a dead URL (one that results in server '404' error) from Google's index. You will need to register for the service. This can be useful if you move material and the Google robot continues to try to spider the old URL. There are also reports this works with '302' redirected pages which can be used by Page Jackers to steal PageRank.
See Also
|
|
©1994-2006 All text and images copyright: www.abcseo.com; last updated: |