====== Spam ======
The term Spam comes from a Monty Python comedy sketch set in a trucker’s café. All the dishes on the menu come with spam - a type of tinned spiced ham. In the computer world spam is used to denote excessive repetition: multiple posts, usually commercial, to forums and unsolicited email are the two most frequent examples. For SEOers the term includes the excessive use of keywords, duplicate content, unnatural link structures and the posting of links to guestbooks and membership lists.
===== Blog comment, guestbook and member list spam =====
The blog or weblog phenomena has done a great deal to revitalize interest in the Internet following the dot.com bust. By using a pre-packaged [[content management system]] (CMS), blogs enable even technical neophytes (aka newbies) to publish their words. Blogs range from personal diaries right through to online-newspapers written by professional writers and journalists who enjoy the editorial freedom the medium offers.
Blogs also have two features which attract high search engine rankings. Bloggers link freely to other sites, creating dense inter-linking between highly themed content. Bloggers are also prodigious, creating large quantities of fresh content. Blogs were designed from the start to be interactive. Readers can post comments and usually include links to other sources. These features mean that the most popular blogs have [[PageRanks]] of 7.
The popularity of blogs was quickly spotted by people wishing to manipulate search engine results. They could boost the rankings of their own sites by using the comment, guestbook or member list features that are part of most blog software. Typing blog, weblog or guestbook into Google will bring up many high-ranking targets, especially when the query is combined with the inurl operator. Usually a spammer’s comment is completely irrelevant and is posted to multiple blogs as part of the same campaign:
Great article about global warming, why don’t you cool off a bit check out this page on hot babes?
Spammers even run automated scripts known as spambots. These attempt to post comment spam to sites running well known blog software. The aim is quantity rather than quality but it can mean that a single site gets hit by huge numbers of comments, often posted at the same time. Spammers are hard to trace as the spambots are frequently run on pirated machines referred to as //botnets//.
Blog spam had the advantage of keyword rich [[anchor text]] coupled with highly ranked pages. The aim is not just to get click through traffic but to subvert the ranking algorithms used by search engines. The fresh content offered by blogs means they get frequent visits from [[search engines]]. A day spent spamming the most popular blogs can rapidly boost a website to the top of the search engine results pages. As is often the case on the Web some of the most virulent spammers are pushing adult content sites and cover their tracks using anonymous proxies and compromised zombie hosts.
The popularity of this technique has spread rapidly and blog spammers have soon found themselves in an arms race. They have to visit the best blogs on an ever more frequent basis as other messages soon push their links off the coveted and highly ranked home page into search engine oblivion.
Needless to say blog owners are none too happy with this state of affairs. Some have removed comment pages or disabled the capability to post links. Others, wishing to preserve the spirit of the medium, spend hours moderating and removing a veritable tidal wave of spam. Technical solutions have been adopted, disguising outbound-links using JavaScript or rerouting links via a hidden page to stop anchor text and PageRank benefits from being transferred. Automated systems block links to known spammers or links using popular spam [[anchor text]] words.
The three main search engines, [[Google]], [[Yahoo!]] and [[Microsoft]] have adopted a nofollow value to the rel attribute of the HTML anchor element:
cheap viagra
This tells search engine [[robots]] not to give any value to this link. Many CMS can automatically use this attribute value when generating comment nad other links. As some comment links are genuinely useful this risks breaking a vital resource for the PageRank algorithm.
Spammers are nothing if not persistent and spam-kiddies are often unaware that their efforts now have little effect. Search for Google for:
Guestbook +
and you can still find many examples like this one on a basketball site:
Name: Penis Enlargement Pills
Web Page: http://www.online-penis-enlargement-pills.com/
Gender: Male
Comments: I just wanted to say WOW! your site is really good and im proud to
be one of your perm. surfers, be sure to my penis enlargement pills project
site, dont laugh! here is my penis enlargement pills site: penis enlargement pills
Spam protection may have the effect of intensifying spam as spambots may take an ever more scattergun approach to posting. One theory on why spammers are so poor at grammar and spelling is that it helps trick automatic (Bayesian) spam filters. I suspect that after typing in 500 spam messages in a session they just get lazy.
=====Referrer spam=====
Referrer spam shows just how ingenious people can be in finding ways to manipulate search engine rankings. When someone clicks on a hyperlink their browser opens up the new web page. As part of the communications process (called HTTP which stands for HyperText Transfer Protocol) their browser sends the web address (URL) of the page that contained the hyperlink. This address is called the //Referrer//. The user’s web server will log this address and it is useful for traffic analysis, for example to judge the effectiveness of [[inbound-links]].
There are a number of programs for analyzing raw log-files such as AWStats and Webalizer. These produce web pages summarizing user access on a monthly basis. Unfortunately many reports are publicly accessible and indexed by search engines and with a little knowledge about their format it is possible to locate them. Searching Google for phrases that typically occur within reports such as:
“Generated by Webalizer” or “Created by awstats”
will return thousands of Webalizer and AWStats reports. It is easy to write a script to make requests to websites with fake Referrer URLs. For example using the popular cUrl tool this would be:
c:\ curl.exe -e http://www.mySite.com/ http://www.targetSite.com/
Webalizer lists the top 25 Referrer URLs in its monthly statistics. Many CMS such as blogs also list top referrers on their home pages. The spammer merely has to bombard the site with enough requests to figure in this chart. This creates an inbound-link, containing [[keywords]], boosting Page or Web Rank. Some of these log pages have surprisingly high Google PageRanks.
Referrer spam has become an increasing problem. Spammers have armies of zombie hosts or botnets at their command ready to launch a campaign. These zombies are computers on the Internet where the spammer has installed a server by using some security flaw in the Operating System, usually Windows. Often a scatter gun approach is adopted, the spammer doesn’t know if the log file is indexed by search engines or not and hopes that at least a percentage of the spam will make it through. Webmasters running Apache can look at the mod_security package as a way to combat this kind of spam by blocking popular keywords in referrer pages, examples would be: poker, Viagra and loans.
The technique is definitely frowned upon by search engines and can get you banned from their index. It manipulates search engine rankings by creating what are in effect fake inbound links. It subverts the HTTP Referrer mechanism. It clogs log files with bogus information and it consumes resources on the target web server.
Spammers may counter that it is up to server administrators to protect against this form of manipulation but that is like saying that homeowners must lock their doors or risk being robbed. There is usually no good reason to have log-files publicly viewable. The log files should be password protected and preferably not visible to the Internet. Webmasters can also use a robots.txt file to tell search engines not to index the directory containing their logs and can turn off the referrer feature in CMS. Log reports have many [[outbound-links]] on a single page so the overall benefit of each link is limited.
===== Keyword Spam =====
Keyword spam is the excessive repetition of keywords on a page. It is usually done using hidden HTML elements that are indexed by search engines but are not visible to users including Title, Meta, and Alt text. Spammers have found that they can disguise keywords in the contents of the page by making the text the same color as the background and tucking it away at the bottom of the page. However this still takes up space so may be noticed by competitors, particularly if they type CTRL-A to highlight all the text on a page. It is possible for search engines to detect text which is the same color as the background and this could flag that the page is using spammy techniques. Microsoft Search claims to automatically penalize such pages.
An extension on the hidden text idea is to hide the keyword spam using style-sheets (CSS). This gives the spammer great scope for stuffing keywords into important elements such as Headings without them being noticed. The following style will format all Heading 1 text as 1pt high white text.
H1 {
font-size : 1pt;
color : white;
}
There are many other ways of hiding content from users such as Layers and IFrames while still having it visible to search engines. Remember that it is possible to detect the most obvious examples of spam although forcing search engines to parse style sheets and other structures slows down indexing so few, if any, currently do this.
===== Search Engines and Spam =====
Tackling spam in results has been one of the major efforts of search engines over the last couple of years. For example in November 2006 Microsoft filed patent application 20060248072 outlining a //system and method for spam identification//. The method takes a multi-pronged approach including identifying pages that look like spam and incorporating user feedback into search results. Microsoft says that its user base of searchers is the best way of identifying whether results are spam. It suggests that something as simple as a toolbar button could be used to flag a page as spam. To prevent a spammer marking competitor pages as spam the user would be tracked via their IP address or network to identify the type and quantity of sites being marked as spam and to compare this with other user input from different queries. An obvious weakness is that a botnet could be used to generate a large amount of feedback from random IP addresses.
Microsoft's patent also suggests that user feedback would be combined with other algorithmic techniques. For example they could examine the percentage of content that is advertising (the so called MFA or Made for AdSense sites), whether there is keyword stuffing or if the site is part of a bad neighbourhood of spam related sites. It may also use intelligence from its content targetted advertising to identify the value of query terms, so called //money words//. These are terms where advertisers bid high rates such as "hotel" or "viagra". Pages that satisfy these terms would have more agressive spam filtering than non-commerical websites. Less agressive filtering may also apply to sites that a user visits reguarly and sites that they link to, so called authority sites. This data could be gathered through the user's toolbar.
===== Reporting Spam =====
If you spot a competitor using obviously spammy techniques you can report them. Google and Yahoo! have web pages that let you specify the exact nature of the problem you have found.
Google
http://www.google.com/contact/spamreport.html
Yahoo!
http://add.yahoo.com/fast/help/us/ysearch/cgi_reportsearchspam
Microsoft Live
http://feedback.live.com/eform.aspx?productkey=wlsearchweb&searchtype=WebSearch&backurl=http://www.live.com/
Obviously they won’t ban sites where there is no contravention so don’t waste time reporting all your competitors.