Phrase Based Indexing and Retrieval Spam Detection

Via the Mad Hat, here’s the interesting part from a PaIR system article:

The process takes place both at indexing and retrieval. In essence the document gets its spam score at indexation and then upon retrieval, should that page be included in the results, weighting is then removed and the page is devalued during the ranking process for previously calculated Spam threshold scoring/weighting.

According to the folks that drafted it, a normal related, topical phrase occurrence (or related phrases) is in the order of 8-20 whereas the typical Spam document would contain between 100-1000 related phrases. So by looking for statistical deviations in related phrase occurrences the system can flag an item as Spam. Once again it is mostly for the high end, but a low deviation count can also be used as a flag for a low occurrences (which could be compared to the link profile for link spam)

Two things to digest there.

1. The indexing method applies a spam score both on indexing and retrieval and
2. Standard Deviation on both the high end and low end could count as a spam flag.

Of course, the only reason spammy docs sometimes have up to 100 times the related phrase density of a non spammy page is because this behavior continues to be rewarded in the SERPs. Even if the spam flag is raised and the site eventually banned, classic keyword and relate phrase stuffing continues to rank in the SERPs.

Both comments and pings are currently closed.

3 Responses to “Phrase Based Indexing and Retrieval Spam Detection”

  1. thegypsy says:

    Hey dOOd – glad to see some more interest (Dave from Reliable here). Recently Bill (slawski) did some work on it and there is even a couple threads on it now at WMW (yea yea fishing weenie land). I have to think tho, as mentioned more than a few times, that it is merely in addition to current Spam filters ( you know – watching fer link spam). So it’s more of a double barrel approach (funnel).

    If you have some tin-foil, make a hat and check out this;

    A little conspiracy theory (on my Blog) involving PaIR and GoogleBOmbs.


  2. [...] So that wraps up today’s post on supplemental results. Hold on, wait a minute, there’s a strange force pulling me to link to this post from Quadszilla… Phrase Based Indexing and Retrieval Spam Detection. I don’t understand it fully but it gets me all black hat tingly. [...]

  3. [...] So, if you want to test how well a Black hat technique like hiding text on a webpage can be, then you should buy a throwaway domain (one you don’t care about), and use it as a testing ground. Make sure the Whois info is different from your primary domains. [...]