Texas Holdem or Stopping Comment and Referrer  Spam

Many people have been reporting a veritable tidal wave of referrer spam recently. Every time someone accesses a page on the web server a line is written into a log file specifying amongst other things, the resource that was accessed, the user's IP address, the time and the resource the user was looking at immediately prior to visiting the site. This last piece of information is called the referrer and is useful, amongst other things, for checking the value of inbound-links.

Here is an example:

151.142.207.11 - 151.142.207.11.217581107936817828 [09/Feb/2005:08:13:38 +0000] "GET /archives/a2003061/ HTTP/1.0" 200 6931 "http://poker.yelucie.com/" "Mozilla/4.0 (compatible; Lotus-Notes/5.0; Windows-NT)"

Referrer spam relies on web masters leaving their server logs open to the public or blog owners showing the top referrers in their site statistics. This means that the referrer URLs may get indexed by search engines and provide a source of inbound-links for the spam site. Inbound-links are a component in the ranking processes of the Google, Yahoo! and MSN Algorithmic Search Engines. Some logs have surprisingly high page ranks so there is a value to the links.

At one time spammers would research the target weblogs to find those that were both indexed by search engines and had a high Google PageRank. He would then launch a script, requesting the smallest page with his website as the referrer URL. A number of requests were necessary for each target as many of the logs only displayed the top 25 referrers.

Recently things have moved up a gear. In January 2005 many webmasters began to complain of a large amount of referrer spam even though, in many cases, their weblogs were password protected. It seems that the spammer was taking a scatter gun approach, assuming that at least some of the targets would yield results.

What is Wrong with Referrer Spamming?

Referrer spamming consumes server resources which are paid for by the website owner or ISP. It slows down access to the site for normal surfers. It uses disk space. It clogs weblogs with bogus information. It subverts the ranking mechanisms used by search engines. Spammers are nothing more than leeches sucking the lifeblood out of the web.

Isn't Spam really Google's fault?

It seems that Heisenberg was right. By observing something you affect the observations. Google's PhDs, for all their brains,  are a somewhat innocent bunch; unable to see the consequences of their actions or understand how the real world operates. By basing their search engine rankings on inbound-links and anchor text they encourage unscrupulous people to exploit weaknesses in the system to boost their websites to the top of Google's rankings.

Google is a great resource for finding information. However the Webosphere didn't ask Google to set up shop. Google is a business. Like the spammers they are in it for the money. Now I'm not saying Google is evil but they need to mature as a business. Instead of focussing on propeller-heads who've probably never had a girlfriend they need to employ some guys with street smarts who can think through the latest whizzy idea before it gets beta'd on the rest of us. Other large businesses have to take some responsibility for their actions (well ok not Microsoft, they have the EULA), so why not Google?

Still there are differences between setting up the environment that encourages spam and actually generating the damn stuff. But if we don't act, search engine spam will harm the web just as surely as UCE has harmed email. We don't want to reach the stage where 85% of all requests to a website are spam do we?

There are other actors. Microsoft for selling a completely insecure operation system in the form of Windows must shoulder a lot of blame. ISPs and Web hosting companies for supporting the spammers.

Clearing up the Mess

What can be done to clear up the mess of spam in your weblogs. A number of technical solutions have been suggested based around the Apache website's mod_rewrite module. The problem, as we will see, is that the spammers don't follow well defined patterns. This can make stopping spam a game of cat and mouse.

The first problem is that filtering spam takes computer resources. As the set of spammers to be filtered grows so does the load on the server. The mod_rewrite module is heavy on the processor. It may simply be better to let the spammer through. An alternative is mod_security, after analysing a log file we will take a look at this.

What's it all about... Alfie?

I wondered where all this spam lead. First of all I did not not click on the link in my log file - this would send my referrer whizzing over to the spam site and might only encourage more of the stuff. I wanted to know who was behind the spam so I went to samspade.org and typed in the domains:

Instead I cut and pasted the domain names.

ronnieazza.com

Samspade told me the Registrant was Susan Lee, living in New York and the administrative contact Evelin Porter. I checked the names in the US white pages and these people don't exist at the addresses given. The IP address of the server is 219.150.118.16.

yelucie.com

Yelucie was hosted by the same IP address. Again the contacts, Harry Graham were bogus. I checked the numbers with a reverse telephone number database and they were not assigned.

6q.org, smsportali.net, future-2000.net

6q.org, smsportali.net and future-2000.net were also on the same IP.

China Telecom

So where was this Web server? Checking the IP address: 219.150.118.16. Showed that the machine was hosted by Chinanet Henan Province. Chinanet and Chinatelecom are notorious spam hosts and should really be booted off the Internet.

Analysis of Keywords

I downloaded my log files for the last week. These were common words used in the referrer fields:

buy, phentermine, xanax, diet-pills, generic-viagra, loan, loans, poker, reductil, soma, tramadol, valium, casino, texas-hold-em, blackjack, levitra, games, cialis, credit, watches, craps, online, roulette, slot-machines,prozac, carisoprodol, meridia, payday, mortgage

The spam seems centred around certain drugs, poker, loans and erm... texax-hold-em, whatever that might be?

Analysis of IP Addresses of Spammers

I looked at the IP addresses being used by the spammers. For brevity I won't include them all here. Just to say that well over 50 machines were involved in sending the referrer spam in a single 7 day period. These machines were spread all over the world. From this I conclude that the spam is being sent by compromised "zombie" hosts in much the same way as a lot of email spam. The machines have become infected by a virus or worm which has installed a spam server. This is probably sent a list of sites to spam.

The Spam Sites

Many people who have been hit by the latest wave of referrer spam have gone to check out the spammers and found a message (usually in poor English) saying the spammer has been reported and the site is closing. The site offers you a form to report the spam where you can enter your URL and email. After nearly 20 years doing Internet stuff I was naturally suspicious so a I checked back on the sites a few weeks later. Well blow me down, they were now up and running and selling watches, drugs and gambling.

It seems this spammer is pretty savvy. By initially putting up a page that made his site look like it was closed he probably hoped to avoid any trouble, he would also pick up some useful emails and URLs from spammers if they used his form.

Stopping Referrer Spam with ModSecurity

We mentioned mod_rewrite earlier but a better way of tackling referrer and blog spam is the mod_security package. ModSecurity is an open source Web Application Firewall for the popular Apache web server (over 70% of web sites use Apache according to netcraft.co.uk) and for Java based web servers. This is basically an intrusion detection and prevention engine that can inspect requests for suspicious activity and perform some action... such as rejecting them. It is more flexible and faster than using mod_rewrite.

The first thing you may want to know is if ModSecurity has been installed by your web host provider. You can do this using a tiny PHP program:

Create a file called info.php:

<? phpinfo() ?>

upload it to your web host and run it from a browser, it should tell you which models have been installed.

Configuring ModSecurity to Stop Referrer Spam

You can configure ModSecurity through your .htaccess file. Note the filename starts with a dot, this is a hidden file on Unix. If you already have an htaccess file make sure you save a copy first up as changes can stop your website running.

The first thing we want to do is turn ModSecurity on with this line

SecFilterEngine On

ModSecurity will now check all requests to the site.

You now need to take some action. Currently most people reccomend returning an HTTP 412 precondtion failed message:

SecFilterDefaultAction "deny,status:412"

This tells the client that there was something about their request that was rejected by the Server.

Finally we want something to filter on. This requires some analysis of the log file. The big problem is the spammer is using different sites each day and the spam is coming from a veritable army of zombies so we can't filter either on his domain name or the client addresses. Well not unless we want to spend our time updating our filter and slow our website to a crawl. The spammer's Modus Operandi seems to be inbound-links, click through traffic and good anchor text. As such his spam is keyword rich. I selected a set of the most commonly used keywords to filter on. I only need to check the HTTP_REFFERER section of the HTTP header for the spam words we listed above.

SecFilterSelective "HTTP_REFERER" "(holdem|poker|loan|mortgage|hold-em)"

Try to keep this list reasonably short as the ModSecurity has to check each term. This blocked about 95% of my spam.

Obviously as spammers get wise to ModSecurity they will probably adapt these keywords. We may soon have to add Baysian filters to our spam checkers.

Testing the Referrer Spam Block

The most flexible tool for testing ModSecurity filters is the command line request tool or cUrl. This is like a super flexible Web browser that runs from a command window. It is available for Microsoft, Mac and Unix operating systems.

Here we run curl, the "-e" flag lets us specify the referrer:

$ curl -e "http://www.hold-em-action.com/blah.htm" http://www.mysite.com

This returns our 412 error message:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML><HEAD>
<TITLE>412 Precondition Failed</TITLE>
</HEAD><BODY>
<H1>Precondition Failed</H1>
The precondition on the request for the URL / evaluated to false.<P>
</BODY></HTML>

Also take a tour around your website to make sure everything works properly. Remember that if you have any pages called "poker" or "loans" then any links will also be filtered.

Configuring ModSecurity to Stop Comment Spam

You can also stop blog comment spam with ModSecurity. As comment pages are dynamic (written in PHP, Perl, Python or another Web programming language). You may only want to check requests to dynamically generated pages, such as blog comments:

SecFilterEngine DynamicOnly

This will save resources on your web server. You will also want to scan data sent to your server using the HTTP POST method. This is normally used for comment form submission:

SecFilterScanPOST On

Finally add a filter.

SecFilter "(holdem|poker|loan|mortgage|hold-em)"

There may be simple means to stop comment spam. I noticed that a lot of the spam my site received were automated requests. It seemed that a script was running that created a user then attempted to post comments to the first 10 entries in my blog. The script relied on the fact that I was running a particular brand of weblog software. The first thing I did was remove any information about the blog software that was being used. This included changing the names of common scripts. I then moved all of the scripts into a new directory structure. You should back up your site before making any changes as this required a number of modifications to code and administration files to work properly but it did stop the spam.

Referrer Karma

Referrer Karma is another solution to referrer spam. The advantage over ModSecurity is that it is targeted at this one problem. It is aimed at Content Management Systems (blogs, forums etc) written in the PHP programming language and with a MySQL database available.

Referrer spam works by applying certain heuristics to information in the HTTP referrer field and rejecting requests that match. These rules of thumb are applied in the following order:

  1. if there is no referrer information or the referrer is the same as the site domain the request is passed.
  2. if the referrer's domain held in the file whitelist.txt the request is passed. For example if the whitelist contains the domain: blogspot.com and the referrer is viagra.blogspot.com the request is passed. Ref-Karma comes with a long whitelist file and it may be necessary to delete some entries.
  3. if the full referrer is matched by some element in whitewords.txt it is passed. For example if the whitewords.txt file contains the term: /wp-admin/ then the referrer: http://texas-holdem.com/wp-admin/ will be passed.
  4. If the referrer's IP address matches a banned value then the request is blocked.
  5. If the referrer's domain matches a white entry in the database table it is passed.
  6. If the referrer's domain matches a black entry then the request is blocked plus the referrer's IP address is banned after a certain number of attempts
  7. Finally if non of the above match then the page referenced in the referrer URL is parsed to see if there is an inbound-link. In this case the referrer is added as a white entry in the database table otherwise it is added as black entry in the table.
  8. If the referrer's domain is unreachable or is not a valid URL the request is blocked.

Configuration

The constants: RK_DB_NAME, RK_DB_USER, RK_DB_PASSWORD, RK_DB_HOST in the file rk_settings_samples.php have to be set to a valid MySQL database. The host will normally be the localhost.

rk_settings_samples.php is renamed to rk_settings.php and all the files are uploaded to the webserver.

Ref-Karma is configured by calling the URL

http://www.myhost.com/ref-karma/referrer-karma.php?ref-karma-setup=true&pwd=password

This creates two database tables: 'ref_karma' and 'ref_karma_logs and gives access to a simple administration interface. To call Referrer Karma before any page processing is done edit the relevant PHP file, for example weblog.php if you are a pMachine user and add the following lines as the first line in the first PHP block:

include_once ("/referrer-karma.php");
check_referrer();

You will need to adjust the path to referrer-karma.php to suit your set-up.

Comments

The neat thing about referrer-karma is its ability to parse the referrer to see if your inbound-link really does occur in the page. Normally this should be the case. However there are problems with this method. Some users configure their browsers to disguise the referrer field for privacy reasons. For example there have been problems with online banks leaving session information in the referrer field which can be used to gain access to bank accounts. The referrer-karma error page does give a link to the resource which the user can click on but this is inelegant. Referrer-karma can also learn what are good and bad referrers so speeding up processing.

Parsing the referrer page, sometimes called scanbacks is expensive in terms of CPU and network resources, especially if the referrer page is large. A list of large files could form the basis of a Denial of Service attack (see below).

The whitewords.txt provides a back-door for referrer spammers. Many people will install referrer-karma with the default file and spammers can exploit this. Before installing referrer-karma check your log files to prepare a list of domains of common referrers (don't include spammers!) and put this in your own whitewords.txt file.

The system is expensive in terms of CPU resources. Tackling referrer spam should occur further forward in the HTTP processing chain.

DOS attacks with Referrer Karma

When I started using Referrer Karma on a production site my worries about its potential for use in a Denial of Service attack (DOS attack) increased and shows that any code added to a website can increase its vulnerability. One of the first referrer pages I checked blocked,  with referrer-karma furiously downloading data in the background. Imagine if I ran a simple CURL script to launch dozens of these requests, there would be very little cost to me but it would tie up a lot of resources on the webserver and potentially on the referring site.

The issue occurs when ref-karma needs to check a referring page to see if the site URL occurs somewhere. The problem is the handling of iframes and scripts. Normally these are put together at the client end. In the case of ref-karma it must check this files as well to see if the site URL is mentioned. If the script or iframe includes other documents these too must be checked recursively. At the moment ref-karma checks down to 8 levels. The problem occurred on a site with dynamic URLs in the query string.

The first change to make is to stop checking the moment the site URL is found in the referring document. It is inefficient to check all the included documents before making this test. In the majority of cases the site URL will be in the main referring document. I modified get_content_url to the function check_content_url. This checks the file block by block, the second the site URL is found it returns true.

The decision to check documents up to 4MBytes is probably unwise. Few web pages are over 100k and most surfers would not find a URL buried more deeply. I changed the code to only read up to the first 400kb of the referring document.

The changes can be found in this file. I don't know if these will be rolled into the main release of ref-karma.

http://www.abcseo.com/papers/referrer-karma.zip

Update December 2005

Dr Dave has made some improvements to the vanilla release of Referrer Karma including tackling the recursion problem outline above. Having dabbled with Bad Behavior which was almost complete ineffective over a 3 month test period I have moved back to Referrer Karma. Scanbacks are inelegant but when 90% of your bandwidth is being eaten by spammers it is the price to pay for a functioning web.

Overreacting to Referrer Spam

This article examines referrer spam and thinks that some webmasters are overreacting.

http://creativekarma.com/ee.php/textacts/comments/overreacting_to_referrer_spam/

Bad Behaviour

Bad Behaviour takes a different approach to referrer spam by examining header information. The author claims that spambots have identifiable signatures. Bad Behavior was designed and built by watching spambots going about their nefarious tasks. Bad Behavior blocks spambots with a 412 error. It also has three configurable User-Agent lists for spambots and other malicious bots which actually identify themselves. Bad Behavior can use string matching or regular expression matching against a User-Agent.

Bad Behavior also will target bots which fail to obey robots.txt. At this time some of these bots are banned by User-Agent, though in the future Bad Behavior will detect them automatically.

Bad Behavior intends to target any malicious software directed at a Web site, whether it be a spambot, ill-designed search engine bot, or system crackers. Bad Behaviour integrates directly with WordPress but can work with any PHP driven site.

http://www.ioerror.us/software/bad-behavior/

Configuration

Recently one of my websites has been receiving huge amounts of referrer spam. It is a static HTML based site. At present it is only the home page that is being attacked so I changed this to a PHP extension and added the following line to the start:

<?php
require_once("./my_bad-behavior/bad-behavior-generic.php");
?>

Note that I've renamed the default install directory of Bad Behaviour to confuse script kiddies. If you don't want database logging this is all you need to do. However logging is a good thing in order to see which requests are being refused. My webspace comes with MySQL as standard, so I created a new database and edited the file bad-behaviour-generic.php:

You need to change make sure wp_bb_date() returns the correct date format and implement the function wp_bb_db_query which executes SQL Insert commands etc to add logs to the database table:

function wp_bb_db_query($query) {
    $link=mysql_pconnect("localhost","username","password") or die("Could not connect to MySQL");
    mysql_select_db("bad-behaviour") or die("Could not select database");

    $result = mysql_query($query) or die("Query failed $query");
    if ($result == FALSE) {
         return FALSE;
    } else {
        return mysql_affected_rows();
    }
}

This connects to the database server using my username and password and selects the bad-behaviour database. You will need to change these values to reflect your environment. The first time Bad Behaviour is used it will create the logging table. You may chose to add the line:

define('WP_BB_NO_CREATE', true);

to bad-behaviour-generic.php so that Bad Behavior can skip checks to see if the database table exists. There is no log viewer interface for Bad Behaviour but you can use the phpMyAdmin or other utility to directly view the log table.

Comments

Bad Behaviour claims to catch all sorts of automated scripts that generate comment and referrer spam. It has not been running long enough to evaluate its effectiveness. However when I analyzed my log files I didn't notice any signature to spambots - they seem to execute on zombie PCs and seem to look like versions of Internet Explorer - presumably because the spambots use the underlying Windows code to access Websites. Watch this space.

Update 19 July 2005

Okay, Bad Behaviour has been running for some time now and seems to let a lot of referrer spam through. I've recently been hit a by a search front end to Google that makes money from Adsense. Here is a typical log entry:

67.28.112.78 - 67.28.112.78.273701121719836309 [18/Jul/2005:21:50:36 +0100] "GET / HTTP/1.1" 200 9830 "http://www.bestfreedirectory.com/" "Mozilla/4.0 (compatible; MSIE 5.01; Windows 98)"

Obviously legit as far as Bad Behaviour is concerned. The IP Address is from a Yahoo!/Geocities host.

Spam Karma

Spam Karma is a programmatic solution to comment spam from the same source as referrer karma.

http://www.unknowngenius.com/blog/wordpress/spam-karma

Conclusions

Unfortunately if you run a public website it is very hard to stop spam. The measures outlined can be thought of as similar to adding window locks and a burglar alarm to your house. They won't stop a determined spammer but they may just persuade him that it is better to try next door.

Further Information

http://atomicplayboy.net/blog/2005/01/30/an-introduction-to-mod-security/