How Google Handles Duplicate Content

This a guest post from Duncan, an internet marketer who blogs about everything from on-site optimization to finding the best links on the net.

Duplicate content is a hot topic at the moment, with much speculation about if it can harm your site, or if you can actually benefit from scraping content from other sites and placing it on your own. Most webmasters, bloggers and SEO experts agree that accidental internal dupe content, caused by pagination, categorization etc, won’t harm your site power (apart from reducing internal linking power on the dupe pages), unless it is interpreted to be manipulative duplication, which can lead to penalties.

Aside from legal ramifications, there seems to be little negative effect from content theft – taking content from other people’s sites and publishing it on your own. Indeed, many people have RSS or other feeds from external sites populating their pages, and do not report any ranking problems for the pages of their site that do have unique copy.

One thing that many people are not clear on though is how Google determines which is the original source of the copy, and which are the duplicate versions. Here are some of things it looks at when determining original from duplicate content.

Finders Keepers

Google prides itself on being able to find new content quickly and add it to its index. It claims that its crawlers are getting quicker all the time and constantly striving for more “real-time” search. It stands to reason then that if Google has faith in its crawling speed, it should attribute some weighting to the version of content it comes across first. What helps is that the more authoritative sites tend to get crawled more frequently than lesser sites, who are more likely to be the ones scraping content.

Blog Post Dates

A slightly more speculative factor that Google might look at as part of its dupe content algorithm is post dates. Many blogs include a date above or below the content, indicating when it was published. Therefore if one version of a piece of content was marked with a date two weeks prior to another version of the content, it would suggest that the earlier version was the original.

However, some scraping sites automatically change the date on their posts to a couple of months previous to try and trick the search engines. A game of cat and mouse can sometimes ensue however by doing this, because if Google visits one day and does not find any content, then comes back the next day and finds the content with a date marked a week ago, chances are something is not right with this site.

Blog Post Date

Authority

As well as more regularly crawling powerful sites, Google also attributes more power to them when trying to distinguish the original source of copy. More respected sites rarely take content from other people, instead creating it themselves or having unique content created for them. Who would you suspect stole the money from the safe, the vicar or the local bum? Don’t answer that.

Backlinks

A big clue to Google about where the content originated is backlinks within the copy. A great deal of content is stolen by automated scraper bots who often maintain links contained within it. This is a good argument by the way for always including at least one internal link in your blogs/articles, because if Google sees a link pointing back to a site that has the same content, chances are that content originated at the arrowhead end of the link.

Your Thoughts

What other factors do you think goes into determining the original from duplicate content?





Genesis Framework

Like the new design? Kikolani uses a theme called Wintersong Pro on the Genesis Framework from StudioPress. It's great design right out of the box, easy to follow installation instructions, and built-in SEO features makes it perfect for professional bloggers.

If you have several websites and blogs powered by WordPress or design websites for others, then you will want the Pro Plus Package. It gives you lifetime access to all of their current 40+ professional designs plus new themes regularly added to their collection for unlimited use on your own websites and blogs as well as your clients' websites and blogs. Learn more about Pro Plus!




Comments

  1. says

    Thanks for the information. I have been reading that a lot of people will take a post and duplicate it 1 article directory as is. I was always under the impression that duplicate content no matter where or how posted was bad thing. I never publish the same content twice for me or for my clients.

    I do like the idea of always adding 1 internal link to the original site. As Google strives to do more and more crawling in real time, I’m just betting that they will also come up with a real time way to figure out which content is original and which is not. I just hope that people stealing content are penalized (eventually) to the point where it just isn’t worth it.
    .-= New from CJ @ Online Technical Writing CreativeAce Grand Opening – Online Store =-.

  2. says

    Hi Duncan.

    I had thought about this a little. It’s cool to see some of the main factors, and they make sense. We just assume that our articles will be viewed as our articles, and these various methods help to maintain that.

    I wonder if the “blog post dates” part is lost for those who don’t show the date of the article above the post. Maybe the comment dates would help there.
    .-= New from Armen Shirvanian Team Up With A Partner To Make Progress =-.

    • Duncan says

      An interesting thought Armen. Google may very well look at the comment dates also. Although it is worth considering the original source blog may not have any commenters and the scapping site may get loads very quickly.

  3. says

    Good insights. The tip on including at least one or two internal links in your posts is well worth doing. If someone is going to scrape your content then you might at least you pick up a couple backlinks out of the process. Plus the internal links are a good way to potentially get visitors to explore other posts/pages of your site. In addition Google seems to like lots of internal links.
    .-= New from Mike @ Computer Tips PCs, Macs and Security, Oh My =-.

  4. says

    This is probably the best written article on this topic yet. You mentioned scraper bots and how it is a good idea to have a link to your site within the article for backlink purposes. I used to get angry when someone scraped my articles until I started to see some (small) backlink benefit. Unfortunately, the benefit is so small it’s hardly worth mentioning as the scraper’s site is usually sub-par.

  5. says

    I’d have to agree with Kathleen. I have been trying to find information on duplicate content, but it is really hard to find quality information – which I believe this is!

    I’ve had a few MFA sites scrape a couple of my blog posts before but they usually just post a couple of sentences and then link to me so I don’t worry about them.
    .-= New from Tom@Market Samurai Promo Code The One Top Affiliate Product For Your Blog! =-.

  6. says

    Hi Duncan,

    This is actually a topic I covered in my latest newsletter. You see, I syndicate content on a regular basis and sometimes, my articles (which are originally written for my blog) get reprinted by my permission on some websites that are not only trusted but considered highly authoritative by Google. A lot of times, the syndication happens within hours of my post.

    What I have found is that 9 times out of 10, the article on the syndicated site will not only outrank my article, but will cause google to filter the original page out of the index.

    Furthermore, these sites also tend to get linked to very quickly as well as “tweeted” (I have had articles tweeted in the hundreds) making it look as though the syndicated website was the original.

    Now, I could pooh-pooh this, quit syndicating and move on. But it is all a matter of picking your poison and what you are hoping to achieve. In my case, it is visibility for some of my websites and considering that many syndication networks have a huge reach thus giving me the potential for more eyes to reach my website, it is a good deal, even if the original was filtered out.

    Just something to think about. I re-purpose content all the time because I know that the more eyes that are on my material, the greater visibility I will gain from it.
    .-= New from Leo Dimilo Ifs, ands, and buts…. =-.

    • Duncan says

      Hi Leo,

      Thanks for the different angle on the issue. Your absolutely right in saying that it’s not always a bad thing for syndicated content to outrank the original. I hope you manage to get a few nice backlinks from those sites though?

      Duncan

    • says

      Leo, thanks for these observations from your experience.

      I started scheduling “syndicated” articles from colleagues who are currently more dormant than not.

      My key question is whether to link back to the original article or not. Personally, I would prefer to do that, but if it would ding either the author or myself, then not.

      We’ll see. Otherwise feels like a win/win for us.
      .-= New from Dave Doolin Top 10 Traits for Finding Your League of Extraordinary Bloggers – Saturday Morning Surfing =-.

      • says

        @Duncan,

        Nope…it doesn’t ding anything. In fact, I have actually created small mini-nets from duplicate content with absolutely no penalty at all and it has actually been more beneficial than not.

        Folks have duplicate content all wrong. If you really want to learn how to benefit from duplicate or syndicated content, then you have to test. This is especially for much debated topics such as this one where opinions about whether it is good or not range so wildly.

        Personally, I think that most just take something and run with based on “he said, she said” information…a lot of time this is to the blogger or marketer’s detriment.
        .-= New from Leo Dimilo Bowling Balls, Postcards Or Give and You Get =-.

  7. says

    I hope Google doesn’t put too much weight behind dates, because as far as disingenuous bloggers go, dates are meaningless.

    Usually, I try to spend a few days each month checking my articles to see who is scraping them. I used to take it a little too personally, but from my experience, in the long run, Google does a pretty good job of sorting it out…
    .-= New from Steve@Lift Chairs Understanding the Lift Chair and its Features =-.

  8. says

    It is really comforting that Google IS doing something with regards to duplicate content. At least we can rest in the comfort of the idea that Google does know and is not allowing it and is doing it’s best to find out the original one.
    .-= New from Andrew@BloggingGuide The best niche – should I change? =-.

  9. says

    I think the authority of domain is the keypoint to determine the original source of the content. The copycat will never be given the authority from others or search engine. In fact, i still do not know what can i do if i found out that my blog content has been copied in others blog without my permission.
    .-= New from KS Chen @ Google Adsense Tips Use Google Adwords Keyword Tool for Keyword Research =-.

  10. says

    I thinking In-linking is good for two things, first for seo purposes and secondly for reducing the effect of content scraping.
    I have seen a lot of content scraping on my blog too, but usually i contact them.

    At one point of time, somebody linked to a blog that was scraping for me and then i contacted the one who linked to me and showed the date of the post and he was convinced that my content is the original one
    .-= New from Gautam Hans @ Blog Godown 5 Tips to Increase Productivity while Blogging =-.

  11. says

    I think here is going very nice discussion on duplicate content. This is one really going very big issues for us. In this regard I want to ask all people about my opinion. I am working as one SEO executive and I am doing offpage activities on that firm.

    While it come to submit article, then I prefer to submit article to web 2.0 sites like squidoo, hubpages, vox, livejournal etc. and linkback to landing page using proper anchor text. I have made account on 10 best web 2.0 site and submitting article there whenever I get article from content writer. So my question is that is it better than submit article on article directories?

  12. says

    Great Post.

    I have thought about duplicate content quite a lot. I work for a few vacation rental management companies. If you have several condos, whether its on management company A or B, the copy can be very similar. The events, attractions, and address can be as well. This can sometimes make adding properties very difficult. Both companies I work for have about 100 properties each but I personally know companies that have over 500 units.

    I feel as though Google treats the real estate/vacation rental industry slightly different than other sites. Very popular sites like Homeaway or VRBO (part of the same network) can have the same property (including all amenities, description, and property name) and I have never heard of any penalties from either site. Maybe it has to do with authority?

    Basically I just told you I have no idea :)

    By the way, I received a link on a very popular blogging site. Over the next 48 hours I received no less than 20 ping backs of sites scrapping the blog owners content. So something should be done–not sure what it is though.
    .-= New from Marc Can Social Media Take Over The Web? =-.

  13. says

    The duplicate content issue is made up of a lot of hype, yes there can be issues, however if your site has a good bit of authority and you set your posts to ping immediately your content is pretty safe. Adding an internal anchor in your posts can be a good idea but I would mix up the anchor text as this can cause penalties for anchor over optimisation.

    Syndicated content is a good idea but only if the link points back to the page where the original content is, otherwise Google may index the syndicated content and throw your version into the supplemental index never to be seen again.

    It’s good to raise these issues as it can be a problem for newer publishers.

    Tim
    .-= New from search engine optimisation Google SEO Report Card: Some Important Take Aways =-.

  14. says

    “Most webmasters, bloggers and SEO experts agree that accidental internal dupe content, caused by pagination, categorization etc, won’t harm your site power…”

    Interesting, so many people wrong all at once.

    Google has stated quite clearly a number of times, THIS is the dupe content to watch out for MOST.
    .-= New from Dennis Edell Meet CJ – My Awesome New Designer! =-.

    • Duncan says

      Care to provide some references Dennis? As far a my understanding goes, Google cares a great deal more about malicious dupe content and doesn’t expect every webmaster to know about the accidental stuff

      • says

        I’ve been thinking about this since I wrote it. I knew someone was going to ask and I wanted to be able to snap one in here. lol

        I can’t find them. There was one actually not too long ago…either on their blog or one of their many TOS/FAQ type pages…I’ll keep looking.

        They do look at the malicious more for sure, but they also feel a webmaster should take as much care as possible to run things right. This is why they also gave examples of “accidental” duping…

        Make sure to no follow/index things like tag pages, archives and all that jazz
        One category per article where possible (more then one becomes dupe)
        NOT to do full articles on home page (single post page then becomes dupe)
        etc., there was more, but that’s the gist.

        All on the articles I now can’t remember to find.

        You’re right, newbies especially won’t know for some time. Malicious is the worst, but they do emhasize these things pretty highly.

        I hope that helps a little.
        .-= New from Dennis Edell 48hr Downtime – Lots Done-LOTS More To-Do! =-.

  15. says

    I first started wondering about duplicate content in conjunction with a guy who was posting a lot of blog content in a LinkedIn group discussion board I moderate. His blog was basically a Google AdWords click-through farm, and none of the content was original.

    They were good, though, at scraping decent content and then generating backlinks… or something else was at play. In every case I checked (ok – it was only three or four checks on my part), the original content ranked lower for SERPs and had lower page rank than the scraper.
    .-= New from James @ Photographic Halligan Tavern, Derry, NH =-.

  16. says

    I’ve always assumed that Google was quick enough to crawl new content that it would be fairly accurate at determining the original source. It’s not something that’s ever been a particularly big issue for me, with a small website. But I suppose stick the odd backlink into any copy I write isn’t a bad move.

    Another alternative is to put a copy of the article onto a couple of syndication websites, all with links back to the original article so that Google has a pretty clear idea of where the original comes from.

  17. says

    Someone just stole one of my articles, substituting my author bio with their own links.

    It’s not fun when that happens, and not much you can do about it when they just post it all over juicer blogs probably not worth the effort to try stopping them.

    The thing I hate most about it is they create a lot of duplicate content that my own articles have to compete with, it probably devalues my articles to some degree.

    Simon

  18. says

    If your blog is set to ping when an article is posted then Google should have an accurate time frame to work with in determining an original. I also have the Google XML Sitemaps plugin that notifies Google, Bing and Ask.com when it is updated after content changes (it can notify Yahoo too if you have a Yahoo ID).

    I guess there is a difference between duplicate content that you’ve produced by syndicating your articles and that which is blatant plagiarism or scraping. It would be nice if Google could distinguish between the two and punish the scrapers.

    On that note, how do you find out if someone is scraping your articles? Is there any software that can make this easier?
    .-= New from LoneWolf@WWW Ramblings Internet Marketing Gone Wild =-.

  19. says

    Great article. However it will become more and more difficult for the Search Engines and other bots to detect duplicate. The reason is, with article spinner software, one can take any interesting article and just substitute works and paragraphs with synonyms. The two documents, the original and the spun version, are actually the same in substance yet different in the “eyes” of search engines.
    Plus Size Woman just posted Tips To Getting That Profile Picture for Your Dating Site!

  20. says

    Good effort! I think the most important factor here is the website authority, the rank. Sites like technorati,CNET reviews will obviously not copy content from an ordinary blogs. Therefore Google should be able to distinguish between the original and duplicate one. Anyhow, I think Google should not index blogs with entirely copied content. Really this is discouraging bloggers to write original and quality content. Google should take it into consideration soon. Thats my message to Google.Thanks!
    Alex just posted Internet Security Software Comparison and Reviews

  21. says

    I don’t think blog post dates works because as you know popular websites are crawled more and if a less popular websites publishes and their pages id not indexed yet ..then if a popular websites steals contents from that less popular website and get indexed .Google may think that less popular website steal content and used duplicate content.

    It not about the post date i think its about the date when the page was indexed.
    saptak mandal just posted Complete SEO Packages

  22. says

    I think it’s quite a good idea that Google is looking after this subject. The reason is many a time its the guys getting hooked off while they have created the content!
    It is time Google takes a deeper look into sites that makes a habit of it by getting ‘non-unique’ content from other places and make it their own.



Disclaimers: This site contains affiliate links, sponsored advertisers, and web analytics tracking code. Email addresses collected on this site will not be sold or given to third parties without your permission. While all attempts to provide accurate and valuable information are always made, the owners and authors of this site are not responsible for the content provided here or linked to on external websites.