Duplicate content can result from many causes, including licensing of content to or from your site, site architecture flaws due to non-SEO-friendly CMSs, or plagiarism. Over the past five years, however, spammers in desperate need of content have begun the now much-reviled process of scraping content from legitimate sources, scrambling the words (through many complex processes), and repurposing the text to appear on their own pages in the hopes of attracting long-tail searches and serving contextual ads (and various other nefarious purposes).
Thus, today we’re faced with a world of “duplicate content issues” and “duplicate content penalties.” Here are some definitions that are useful for this discussion:
This is content that is written by humans; is completely different from any other combination of letters, symbols, or words on the Web; and has clearly not been manipulated through computer text-processing algorithms (such as Markov chain–employing spam tools).
These are small chunks of content such as quotes that are copied and reused; these are almost never problematic for search engines, especially when included in a larger document with plenty of unique content.
Search engines look at relatively small phrase segments (e.g., five to six words), checking for the presence of the same segments on other pages on the Web. When there are too many “shingles” in common between two documents, the search engines may interpret them as duplicate ...