Duplicate Content

A great deal of the Web is duplicate or near-duplicate content. Documents may be served in different formats: HTML, PDF, Text for different audiences. Documents may get mirrored to avoid delays or to provide fault tolerance. Content is syndicated and re-branded for different audiences and markets. Some websites aggregate or incorporate content from other sources on the Web, the most common example are RSS news feeds. Affliate websites present identical storefronts with only cosmetic changes. Press releases are often duplicated by many media outlets. Businesses wishing to protect their trademarks often register different versions of their domain name which all point to the same content but look like different websites from the point of view of a search engine. Content management systems, forums and blogs are often designed to let the same content be accessed through alternative URLs.

Finally there is a problem of plagiarism and copying from public domain sources, such as Wikipedia, the Open Directory Project and Project Gutenberg. This is often done to create large, content rich sites in order to manipulate rankings and generate revenue based on content targeted advertising.

When users submit queries to search engines they do not want the results pages stuffed with many duplicate or near duplicate pages. Indexing and filtering near duplicate content also puts a load on search engines in terms of storage and computational resources. Algorithms already exist for efficiently classifying duplicate content. For example a Hash function can generate a numeric fingerprint representing a page's content. Pages with identical fingerprints can be dropped from search results and excluded by robots when they next index pages.

Near duplicate pages are more complicated. Both AltaVista (now owned by Yahoo! - patents: 5,970,497 and 6,138,113), Google (6,615,209 and 6,658,423) have been awarded US patents that improve on existing methods for classifying duplicate content. The secret is to make comparisons quickly without doing some kind of word-by-word matching. One of Altavista's patents looks for similarities in the outbound links on a page. Google's patents focus on generating hashes or fingerprints for parts rather than the whole page. Now to you and me neither of these ideas would seem to be that novel and probably took less than a wet Sunday afternoon in Menlo Park to conceive but you have to remember that the US patent office also gave a patent for how to use a garden swing (US Patent No. 6,368,227). The patent land-grab is also about having some bargaining chips with other companies, many would stand up about as well as a beach condo in a Florida Hurricane if tested in court. However they do have the effect of discouraging new entrants to the market.

Microsoft has also gotten into the game with a patent application (20060248066) for a “system and method for optimizing search results through equivalent results collapsing”. This patent is based on a method known as shingleprints which is the subject of a previous patent application (20050210043). A shingleprint reduces a document to a set of features that are representative of the document. For example this could be all the proper-nouns in the document. The number of common features, divided by the total number of features gives a number between 0 and 1. Essentially similar documents will have a shingleprint closer to 1.

Both Microsoft and Google's patents are capable of identifying duplicate content that is either a subset of another document or substantially similar. Google suggests that the most relevant document is returned in the results pages. This could be the most recent (although to my mind most recent would imply a copy) or the document with the highest PageRank. Microsoft say that user clicks could be used to select the most popular version to return in future queries. Probably the biggest target in Google's sights at the moment are the many duplicates of public domain content such as Wikipedia. Some webmasters have found their original pages have been dropped in favor of mirrors so the system is not without flaws. The system should also foil domain spammers who register many different domain names under different keywords all pointing to the same website. Google keeps many of what it considers duplicate pages in its secondary Supplemental Results index.

The implication of all this from an optimization perspective is that search engines are getting increasingly sophisticated in identifying duplicate content. Building a site using duplicate content to inflate rankings will become increasingly difficult.

books/seo/duplicate-content.txt · Last modified: 2006/11/11 23:00 (external edit)
Recent changes RSS feed