Every Search Engine Spammer Needs to Know…

Last Thursday, the boys at the ‘plex announced that they would be releasing 10 gazillion keywords for statistical analysis and other research. That perked my ears up right away. We love large data sets because they are the cornerstone of building massive spam sites targeted niche aggregators.

The fine print is that you have to jump through some hoops to get the data – details are to be released, but you will likely have to be a member of the L.B.C.


“So tell me wuts up wit dis LBC thang?”

Wait . . . make that the LDC, the Linguistic Data Consortium. Their annual membership is $20k and they sometimes make you pay more for certain data sets.

The almost invisible print is pointed out by greywolf and confirmed by Matt Cutts in this threadwatch discussion.

When people sell a mailing list it’s extremly common for sellers to seed the list with some names that only exist for the purpose of catching people who are misusing it. I would have to assume the boys and girls at the plex would do the same. – Greywolf

graywolf, you have a devious, devious mind. How many other people would consider seeding the terms with some nonsense phrases? I ask you–how many other people would come up with an idea like that?

Well, I guess I can think of a couple people.. – Matt Cutts

graywolf, yes you should take it as a compliment. Not to worry, I’m familiar with the practice. My favorite is Lye Close, the fake street in London: http://wiki.openstreetmap.org/index.php/Copyright_Easter_Eggs

billhartzer, sshhh. I was just watching boogybonbon find out about “google monitor query or googletestad” today. Don’t ruin the fun. – Matt Cutts

referring to boogybonbon’s post on keyword research.

Trap admiral akabar from star wars

That’s right, it’s a trap.

We know about poisoning seasoning keyword lists – in fact sometimes we’ll do it ourselves. However, this exchange confirms what a few of us have been thinking all along – that the search engines are on to this tactic and use it as well.

Are you using wordcatcher, overture, the google keyword suggestor or any data directly from the search engines? It seems there’s a good chance that it could be a trap. If you’re using poisoned data, that could certainly explain why your sites are only lasting 6-9 weeks in the SERPs.

Understanding this kinda puts a damper on the 400+ meg file (update:mirror with data)that contains all the AOL searches of 500k users for the last 3 months.

“Jacta alea est!” – Julius Caesar

It’s a war. Develop your own supply lines so you don’t have to get food from the enemy.

Both comments and pings are currently closed.

15 Responses to “Every Search Engine Spammer Needs to Know…”

  1. Dudibob says:

    woah! simple, but effective

  2. rxbbx says:

    “So tell me wuts up wit dis LBC thang?” 🙂

  3. DevilKing says:

    A friend posted the other day Matt’s comment about Boogybonbon.com. But what does it have to do with all this?

  4. ka82 says:

    Wordtracker is a good supply line. Does anybody disagree?

  5. QuadsZilla says:

    DevilKing,

    Google spit out “google monitor query or googletestad” as one of the top searches. But guess what? no one is searching for that. So when you build a massive site and page 34,532 is optomized for that it gets a red flag for human review. When they see your site is computer generated, they boot your ass.

    Get it?

  6. DevilKing says:

    hmm.. But thats the thing that is intresting, Im not scraping Google and people are searching for google_monitor_query… mutch to think about!

    Thanks!

  7. […] After my post about my keyword research system, seoblackhat posted about another thread on another site talking about the group LDC selling a big old keyword list and provided a link to this site. […]

  8. QuadsZilla says:

    update on that AOL keyword list:

    http://battellemedia.com/archives/002792.php

    if anyone got it before they took it down. PLEASE email me.

  9. QuadsZilla says:

    hmm – yes it looks that is an actual querry – but probably one by google – but how could it possiably outrank food or GOOGLE for that matter? That’s impossiable.

    What matt is saying is that Google send send high volumes of queries to make it appear as though people are searching for any term they like.

    Imagine if they were’t the “Do no Evil” crowd. How easy would it be for them to manipulate mid capp stocks? or world politics for that matter.

  10. DevilKing says:

    I agree 100%, and Im rather pissed that it was such a low punch by Google, and that I really did not give it much thought at the time as I was so excited about the system itself working really good.

    Not only does it show that Google is willing to stoop low to kick others in the balls, they are also willing to send mass robots to other search engines to try and seed their search results. Very dirty in my book!

  11. goldrush says:

    What are some good sources of natural keyword data then? Your own log files are one place to look but it’s good to have a headstart when you’re starting building for new niches.

  12. jerome says:

    ok, i give up. How The F*** do you open a 40mb txt file without crashing ur pc???

  13. […] Interesting take on keyword list freebies . Are you using wordcatcher, overture, the google keyword suggestor or any data directly from the search engines? It seems there’s a good chance that it could be a trap. If you’re using poisoned data, that could certainly explain why your sites are only lasting 6-9 weeks in the SERPs. […]

  14. […] 1. Popular sites with nice user interfaces that are basically glorified scraper sites 2. The New York Times cloaking; in fact, of the top 500 alexa sites, less than 10% show the same content all over the world 3. Major sites creating computer generated content 4. Google Poisoning Keyword lists (by referral spamming) 5. Massive site Networks purchased and built primarily to link farm […]

  15. […] Turns out that Google’s been using them for a while. (Hat tip Quadszilla) I daresay that they’re not so effective at using it yet, or else they don’t care that much when they know that all the players in an industry have their hands dirty. But as in chess, you can’t count on your opponent (here, Google), not making the best move. And integrating N grams is really a great move to make. […]