Next: Entry Pages
Previous: Doorway Pages

Duplicate Content

A great deal of the Web is duplicate or near-duplicate content. Documents may be served in different formats: HTML, PDF, Text for different audiences. Documents may get mirrored to avoid delays or to provide fault tolerance. Content is syndicated and re-branded for different audiences and markets. Some websites aggregate or incorporate content from other sources on the Web. Press releases are often duplicated by many media outlets. Businesses wishing to protect their trademarks often register different versions of their name which all point to the same content but look like different websites from the point of view of a search engine.

Finally there is a problem of plagiarism and copying from public domain sources, such as Wikipedia, the Open Directly Project and Project Gutenberg. This is often done to create large, content rich sites to manipulate search engine rankings and generate revenue based on content targeted advertising.

When users submit queries to search engines they do not want the results pages stuffed with many duplicate or near duplicate pages. Indexing and filtering near duplicate content also puts a load on search engines in terms of storage and computational resources. Algorithms already exist for efficiently classifying duplicate content. For example a Hash function can generate a numeric fingerprint representing a page's content. Pages with identical fingerprints can be dropped from search results and excluded by robots when they next index pages.

Near duplicate pages are more complicated. Both Altavista (now owned by Yahoo! - patents: 5,970,497 and 6,138,113) and Google have been awarded US patents (6,615,209 and 6,658,423) that improve on existing methods for classifying duplicate content. The secret is to make comparisons quickly without doing some kind of word-by-word matching. One of Altavista's patents looks for similarities in the outbound links on a page. Google's patents focus on generating hashes or fingerprints for parts rather than the whole page. Now to you and me neither of these ideas would seem to be that novel and probably took less than a wet Sunday afternoon in Menlo Park to conceive but you have to remember that the US patent office also gave a patent for how to use a garden swing (US Patent No. 6,368,227). The patent land-grab is more to have some bargaining chips with other companies, many would stand up about as well as a beach condo in a Florida Hurricane if tested in court. However they do have the effect of discouraging new entrants to the market.

Google's patents are capable of identifying duplicate content that is a subset of another document. The inventors suggest that the most relevant document is returned in the results pages. This could be the most recent (although to my mind most recent would imply a copy) or the document with the highest PageRank. Probably the biggest target in Google's sights at the moment are the many duplicates of public domain content such as Wikipedia. Some webmasters have found their original pages have been dropped in favor of mirrors so the system is not without flaws. The system should also foil domain spammers who register many different domain names under different keywords all pointing to the same website.

Search Engine Optimization Book            

See Also

Page-Jacking

Home ] Table of Contents ] Start ]