This a guest post from Duncan, an internet marketer who blogs about everything from on-site optimization to finding the best links on the net.
Duplicate content is a hot topic at the moment, with much speculation about if it can harm your site, or if you can actually benefit from scraping content from other sites and placing it on your own. Most webmasters, bloggers and SEO experts agree that accidental internal dupe content, caused by pagination, categorization etc, won’t harm your site power (apart from reducing internal linking power on the dupe pages), unless it is interpreted to be manipulative duplication, which can lead to penalties.
Aside from legal ramifications, there seems to be little negative effect from content theft – taking content from other people’s sites and publishing it on your own. Indeed, many people have RSS or other feeds from external sites populating their pages, and do not report any ranking problems for the pages of their site that do have unique copy.
One thing that many people are not clear on though is how Google determines which is the original source of the copy, and which are the duplicate versions. Here are some of things it looks at when determining original from duplicate content.
Google prides itself on being able to find new content quickly and add it to its index. It claims that its crawlers are getting quicker all the time and constantly striving for more “real-time” search. It stands to reason then that if Google has faith in its crawling speed, it should attribute some weighting to the version of content it comes across first. What helps is that the more authoritative sites tend to get crawled more frequently than lesser sites, who are more likely to be the ones scraping content.
Blog Post Dates
A slightly more speculative factor that Google might look at as part of its dupe content algorithm is post dates. Many blogs include a date above or below the content, indicating when it was published. Therefore if one version of a piece of content was marked with a date two weeks prior to another version of the content, it would suggest that the earlier version was the original.
However, some scraping sites automatically change the date on their posts to a couple of months previous to try and trick the search engines. A game of cat and mouse can sometimes ensue however by doing this, because if Google visits one day and does not find any content, then comes back the next day and finds the content with a date marked a week ago, chances are something is not right with this site.
As well as more regularly crawling powerful sites, Google also attributes more power to them when trying to distinguish the original source of copy. More respected sites rarely take content from other people, instead creating it themselves or having unique content created for them. Who would you suspect stole the money from the safe, the vicar or the local bum? Don’t answer that.
A big clue to Google about where the content originated is backlinks within the copy. A great deal of content is stolen by automated scraper bots who often maintain links contained within it. This is a good argument by the way for always including at least one internal link in your blogs/articles, because if Google sees a link pointing back to a site that has the same content, chances are that content originated at the arrowhead end of the link.
What other factors do you think goes into determining the original from duplicate content?