Phrase Based Indexing and Retrieval Spam Detection

Via the Mad Hat, here’s the interesting part from a PaIR system article:

The process takes place both at indexing and retrieval. In essence the document gets its spam score at indexation and then upon retrieval, should that page be included in the results, weighting is then removed and the page is devalued during the ranking process for previously calculated Spam threshold scoring/weighting.

According to the folks that drafted it, a normal related, topical phrase occurrence (or related phrases) is in the order of 8-20 whereas the typical Spam document would contain between 100-1000 related phrases. So by looking for statistical deviations in related phrase occurrences the system can flag an item as Spam. Once again it is mostly for the high end, but a low deviation count can also be used as a flag for a low occurrences (which could be compared to the link profile for link spam)

Two things to digest there.

1. The indexing method applies a spam score both on indexing and retrieval and
2. Standard Deviation on both the high end and low end could count as a spam flag.

Of course, the only reason spammy docs sometimes have up to 100 times the related phrase density of a non spammy page is because this behavior continues to be rewarded in the SERPs. Even if the spam flag is raised and the site eventually banned, classic keyword and relate phrase stuffing continues to rank in the SERPs.

  1. thegypsy says:

    Hey dOOd – glad to see some more interest (Dave from Reliable here). Recently Bill (slawski) did some work on it and there is even a couple threads on it now at WMW (yea yea fishing weenie land). I have to think tho, as mentioned more than a few times, that it is merely in addition to current Spam filters ( you know – watching fer link spam). So it’s more of a double barrel approach (funnel).

    If you have some tin-foil, make a hat and check out this;

    A little conspiracy theory (on my Blog) involving PaIR and GoogleBOmbs.


