Next: Sandbox
Previous: Robots and Macromedia Flash

Bad Robots and Spiders

Not all robots are good. Some roam the web looking for security holes, probing for known weaknesses in web servers and content management systems. Gateways for sending email are particularly sought after as they can be used by spammers for hiding the origins of their junk email. Other robots harvest email addresses from web pages to add to spam lists. Examples are EmailSiphon and Cherry Picker and they are normally referred to as spambots.

Some robots may appear to be more benign. A site I manage was recently visited by a robot identifying itself as:

    NPBot

it belongs to Name Protect, a company that searches the web for its clients looking for intellectual property infringements on your server. Nothing much wrong with that except that the robot consumes resources and the results will not increase visitors so there is no advantage to having their robot to visit.

Some bad robots don't obey the robots.txt file. In this case the site can be banned by its IP address or range of addresses. This can be done through the web server's administration utility or directly in the .htaccess file in the case of the venerable Apache web server:

<Limit GET>
order allow,deny
deny from 63.241.61.*
allow from all
</Limit>

This should be done judiciously as you may block some real users and it puts extra load on the web server as it now has to check the client's IP address with each request.

Robots Exclusion Standard

The robot exclusion standard gives more information about the robots.txt synatx.

http://www.robotstxt.org/wc/norobots.html

Search Engine Optimization Book  amazon.com buy button   amazon.co.uk buy button   lulu buy button    barnes and noble buy button

 

See Also

Home ] Table of Contents ] Start ]