Archive for the ‘Keyword Stuffing’ Category

Phrase Based Indexing and Retrieval Spam Detection

Via the Mad Hat, here’s the interesting part from a PaIR system article:

The process takes place both at indexing and retrieval. In essence the document gets its spam score at indexation and then upon retrieval, should that page be included in the results, weighting is then removed and the page is devalued during the ranking process for previously calculated Spam threshold scoring/weighting.

According to the folks that drafted it, a normal related, topical phrase occurrence (or related phrases) is in the order of 8-20 whereas the typical Spam document would contain between 100-1000 related phrases. So by looking for statistical deviations in related phrase occurrences the system can flag an item as Spam. Once again it is mostly for the high end, but a low deviation count can also be used as a flag for a low occurrences (which could be compared to the link profile for link spam)

Two things to digest there.

1. The indexing method applies a spam score both on indexing and retrieval and
2. Standard Deviation on both the high end and low end could count as a spam flag.

Of course, the only reason spammy docs sometimes have up to 100 times the related phrase density of a non spammy page is because this behavior continues to be rewarded in the SERPs. Even if the spam flag is raised and the site eventually banned, classic keyword and relate phrase stuffing continues to rank in the SERPs.

Amazing Real Search Queries

We got our hands on a source for real searches from “a major search engine”. These are not referral logs from any site, but actual searches typed into the search engine. It’s similar to the AOL database in that sense. While loading up 5,824,601 of the real search phrases for our forum members to use, I took a look at some of these crazy search queries. Here were some of the highlights:

how do i stop taking phentermine
stop taking lexapro
bus stop whore
look up young girls skirts
how to kill yourself in 60 secounds
how to kill a sister
how to kill dogs with antifreeze
how to castrate your husban
my husband doesn’t care about our marriage and my kids won’t talk to him
kurt cobain is a god
33% of women say they’ve had this happen to them at work. 80% would like to have it happen
how can you help save the rain forest
where does bill gates live

There are tons more amazing searches in there. Unlike the AOL searches, our db does not link the search to the users. Seeing things like this only reinforce the need for Internet privacy. Do you want slimely politicians getting there hands on information like this for blackmail, campaigns, or political purposes? I know I don’t. How long before some “high morality” group thinks it’s a good idea to legislate against Thoughtcrimes?

Anyway enough with the politics and onto making money. Forum members can now download the entire list of 5,824,601 real search phrases from this thread.

Enjoy!

Every Search Engine Spammer Needs to Know…

Last Thursday, the boys at the ‘plex announced that they would be releasing 10 gazillion keywords for statistical analysis and other research. That perked my ears up right away. We love large data sets because they are the cornerstone of building massive spam sites targeted niche aggregators.

The fine print is that you have to jump through some hoops to get the data – details are to be released, but you will likely have to be a member of the L.B.C.


“So tell me wuts up wit dis LBC thang?”

Wait . . . make that the LDC, the Linguistic Data Consortium. Their annual membership is $20k and they sometimes make you pay more for certain data sets.

The almost invisible print is pointed out by greywolf and confirmed by Matt Cutts in this threadwatch discussion.

When people sell a mailing list it’s extremly common for sellers to seed the list with some names that only exist for the purpose of catching people who are misusing it. I would have to assume the boys and girls at the plex would do the same. – Greywolf

graywolf, you have a devious, devious mind. How many other people would consider seeding the terms with some nonsense phrases? I ask you–how many other people would come up with an idea like that?

Well, I guess I can think of a couple people.. – Matt Cutts

graywolf, yes you should take it as a compliment. Not to worry, I’m familiar with the practice. My favorite is Lye Close, the fake street in London: http://wiki.openstreetmap.org/index.php/Copyright_Easter_Eggs

billhartzer, sshhh. I was just watching boogybonbon find out about “google monitor query or googletestad” today. Don’t ruin the fun. – Matt Cutts

referring to boogybonbon’s post on keyword research.

Trap admiral akabar from star wars

That’s right, it’s a trap.

We know about poisoning seasoning keyword lists – in fact sometimes we’ll do it ourselves. However, this exchange confirms what a few of us have been thinking all along – that the search engines are on to this tactic and use it as well.

Are you using wordcatcher, overture, the google keyword suggestor or any data directly from the search engines? It seems there’s a good chance that it could be a trap. If you’re using poisoned data, that could certainly explain why your sites are only lasting 6-9 weeks in the SERPs.

Understanding this kinda puts a damper on the 400+ meg file (update:mirror with data)that contains all the AOL searches of 500k users for the last 3 months.

“Jacta alea est!” – Julius Caesar

It’s a war. Develop your own supply lines so you don’t have to get food from the enemy.

Google: “Keyword Density Matters More Than Links”

Check out this screenshot of a search I did yesterday:

google supplemental results

The search string: bookmarklets seoblackhat.

1st Result in Google for seoblackhat bookmarklets:

seoblackhat.com/2006/03/06/how-to-add-sexy-bookmarklet-buttons-to-your-blog/feed

The size of the web page is 5340 bytes.

No Title or meta tags.

But uses XML title: Comments on: How to Add Sexy Bookmarklet Buttons to Your Blog

bookmarklet – 18 – 2.50%
seoblackhat – 11 – 1.53%

Google sitemap priority: 0.5

Total links to URL: 0 via

In Google Supplemental Results: Yes

vs.

2nd result in Google for seoblackhat bookmarklets:

seoblackhat.com/2006/03/06/how-to-add-sexy-bookmarklet-buttons-to-your-blog/

Title: How to Add Sexy Bookmarklet Buttons to Your Blog SEO Black Hat: SEO Blog
Description: SEO Black Hat : A Great Tutorial on How to Add Sexy Bookmarklet Buttons to your WordPress Blog
Keywords: SEO Black Hat , Black Hat, Black Hat SEO, Search Engine Optimization,
Robots: All,Index,Follow

The size of the web page is 20111 bytes.

Keywords found on page:
bookmarklets – 6 – 0.90%
seoblackhat – 3 – 0.45%

Keywords found in the Anchor tags:
bookmarklet – 17
seoblackat – 2

Keywords found in the IMG Alt tags:
bookmarklet – 17
seoblackhat – 0

Google sitemap priority: 0.5

Total Links to URL: 87 via

In Google Supplemental Results: No

There are several surprising things about these results.

1. A supplemental page can rank above non supplemental results.

2. An RSS 2.0 page can outrank a similar page in html.

3. A page on a topic that has 0 links can outrank a page that has 87 links on the same domain.

Conclusions: Google obliviously cares about links. However, Google seems to be giving the link trust to the domain rather than to the individual page. Then, on a given domain, Google determines relevance of a page based on keyword density even if another page on that topic has more inbound links. Keyword density matters. Domain trust is so important that supplemental results can outrank non supplemental results of a less trusted domain.

This actually isn’t such a bad idea. However, one of the biggest flaws with the current implementation is that an RSS 2.0 page can rank above an html page. Google should change this. Unless a user specifically says they are searching for RSS / XML (or PDF for that matter) formatted pages, html pages should be given much more weight. The last thing anyone wants, (Searcher, Webmaster or Google) is for a user to query and land on a page that is not formatted for their viewing pleasure.

If keyword density is so important to getting the user to the right page on my domain, shouldn’t I be cloaking? As long as I’m not misleading the user – shouldn’t Google change their upsurd public stance against cloaking so webmasters can help with indexing? I’m a target so I really can’t cloak this domain. However, if your domain is more like nytimes.com than seoblackhat.com – you really should be cloaking.

Free Cloaking Script

You’re broke as a joke but want to cloak: So what can you do? How about a free cloaking script?

Let’s say you’ve used widgetbaiting or the markov chain to create 30,000 pages of unique content about bacon polenta recipes. Of course, no human surfer wants to read those pages but they are great spider food.

Well if you don’t want to use IP delivery like you’re supposed to, you can use this code to send your surfers to a sell page with text written for human consumption.

Now, this is not some unsneaky java redirect that will get you banned in the Search Engines. * If you use this code, you may get banned in some search engines.* Rather, it’s a error loophole designed for you to exploit:

<img src=nofilehere.gif onerror=window.open(‘http://seoblackhat.com’,’_top’)>

Just make a page with any kind of spider food / keyword spam that you want on it and then add that line to the page.

When surfers visit the page, they will be sent to “seoblackhat.com” because the requested image file does not exist (therefore there will be an error). The spiders and search engines, on the other hand, will all see the original page.

This free cloaking script is inferior to premium cloaking software for many reasons. If you are scraping content, this method does nothing to help you get past duplicate content filters. This free cloaking code does not protect your code from surfers or your competition. Surfers will briefly see these spider food pages load. They may, in turn, report you to the search engines who could decide that using this code in the manner described is abusive. So, I would not recommended it for sites that you cannot afford to have banned.

Many high profile sites and fortune 500 companies use Cloaking to send different content to different IP addresses. But they don’t use code like this or cheesy redirect scripts – they use sophisticated cloaking software – IP delivery is the safer and preferred way to cloak. Honestly, I’ve never even heard of someone actually getting banned just for IP cloaking. I know that people do get banned for using crappy JavaScript redirects but in my opinion, getting banned for IP Cloaking is one of the great Black Hat SEO myths; it just doesn’t happen.

Yahoo and Synd8- Two more Filthy Black Hat Spammers

Here are two more examples of Search Engine “Spam” by big name companies. The first is classic SEO Black Hat spam. Syndic8.com Computer Generated almost 200,000 pages at sub domains of their main site, stuffed them with keywords, and stuck adsense on the pages. For details, check out the Syndic8 gets Wiped by Google – WordPress Style threadwatch discussion.

<applauding>Fantastic Work boys!</applauding> (except for the getting caught part.) However, when they were caught and banned by the search engines, they just took down the black hat spam pages and their site was unbanned – just like what happened when WordPress was caught.

The lesson is: if you are big enough, you can get away with anything when it comes to Search Engine Spamming. What are the Search engines gonna do? Not list WordPress and Syndic8? yea right! It’s like having diplomatic immunity – the normal rules / laws just don’t apply.

So why do these reputable companies engage in SEO Black hat strategies?

The answer is obvious. We’ve all seen the business model:

Step 1. Make a Website
Step 2. ???
Step 3. Profit.

In case you were wondering, Step 2 is Implement SEO Black Hat Strategies and Tactics.

Next, I would like to take a moment to welcome Yahoo to the SEO Black Hat Fraternity. We know their representatives specifically call hidden test / color on color text in a page spam.

Highlight between the “different code on the next code on the next screen” and the close window button on this yahoo link and you’ll find white on white that reads:

Visually impaired or blind users: We can help you register. So that a customer care representative can contact you, please provide your phone number in addition to your required email address when you contact us by pasting this URL into your browser:
http://add.yahoo.com/fast/help/us/edit/cgi_access

If you’re like “that’s not Spam – that’s for the text readers the Blind use, you jackass!” Then could I write in black on black text:

“Visually imaired or blind users: This site is about Keyword, Keyword (verb), keyword (noun) . . . etc” ?

hmmm . . . i might try that!

Now while I don’t call this Spam, (for me SPAM is SEPRs Positioned Above Mine), it does fall into Yahoo’s definition. So Yahoo only get’s a technical addition to our club.

Yahoo – consider yourself on probation. While, technicaly, you are an SEO Black hat, you’re going to have to step it up a notch to keep your SEO Black Hat membership card.

Google and WordPress are Search Engine Spammers

So what kind of evil scum participate in SEO Black Hat Search Engine Spamming, Cloaking and Keyword Stuffing?

Well Google and WordPress to start. These two articles:

Google Caught Cloaking and Keyword Stuffing?
and
WordPress Website’s Search Engine Spam

illustrate how everyone is doing it – and sometimes even the big boys get caught. Two quotes that aptly apply to SEO Black Hat:

“If you ain’t cheating, you ain’t trying” and “if you ain’t got caught, you ain’t cheating.”

So if you are going to use SEO Black Hat tactics like Keyword Stuffing (Search Engine Spamming, Black Hat SEO, Cloaking, IP Delivery) be smart about it so you’re not easily caught. Remember:

“The greatest trick the devil ever pulled was convincing the world he didn’t exist.” -Verbal Kent.