Splogs (or Snews) Using Web-Stemming

Most are fairly familiar with the current trend in splogs that are stuffed with three AdSense units above the fold of the page.

These pages use blackhat SEO techniques of scraping RSS feeds from multiple blogs. They are then displayed on Google (if the SEO-er is innovative enough).

If you use partial feeds, you’re safe, right?

Here’s a Python Webstemmer that takes it all to a new level.

“Snews” — scraping the news sites

Here’s their claimed accuracy:

New York Times 488.8/552.2 (88%)
Newsday 373.7/454.7 (82%)
Washington Post 342.6/367.3 (93%)
Boston Globe 332.9/354.9 (93%)
ABC News 299.7/344.4 (87%)
BBC 283.3/337.4 (84%)
Los Angels Times 263.2/345.5 (76%)
Reuters 188.2/206.9 (91%)
CBS News 171.8/190.1 (90%)
Seattle Times 164.4/185.4 (89%)
NY Daily News 144.3/147.4 (98%)
International Herald Tribune 125.5/126.5 (99%)
Channel News Asia 119.5/126.2 (94%)
CNN 65.3/73.9 (89%)
Voice of America 58.3/62.6 (94%)
Independent 58.1/58.5 (99%)
Financial Times 55.7/56.6 (98%)
USA Today 44.5/46.7 (96%)
NY1 35.7/37.1 (95%)
1010 Wins 14.3/16.1 (88%)
Total 3829.1/4349.2 (88%)

It’s fairly accurate with an 88% average while scraping professional news sources. If you read a lot of news online, you’d be fairly familiar with how much separation of text there it — meaning, news items broken up with random ads. Now, how much easier would it be to scrape WordPress blogs that EACH have the SAME EXACT template structures? Not too much.

Below is how text is broken up:

$ cat cnn.txt

!UNMATCHED: 200511210103/www.cnn.com/                                             (unmatched page)!UNMATCHED: 200511210103/www.cnn.com/privacy.html                                 (unmatched page)

!UNMATCHED: 200511210103/www.cnn.com/interactive_legal.html                       (unmatched page)

...

!MATCHED: 200603010455/www.cnn.com/2006/HEALTH/02/09/billy.interview/index.html   (matched page)

PATTERN: 200511210103/www.cnn.com/2005/POLITICS/11/20/bush.murtha/index.html      (layout pattern name)

SUB-0: CNN.com - Too busy to cook? Not so fast - Feb 9, 2006                      (supplementary section)

TITLE: Too busy to cook? Not so fast                                              (article title)

SUB-10: Leading chef shares his secrets for speedy, healthy cooking               (supplementary section)

SUB-17: Corporate Governance                                                      (supplementary section)

SUB-17: Lifestyle (House and Home)

SUB-17: New You Resolution

SUB-17: Billy Strynkowski

MAIN-20: (CNN) -- A busy life can put the squeeze on healthy eating. But that     (main text)

         doesn't have to be the case, according to Billy Strynkowski, executive

         chef of Cooking Light magazine. He says cooking healthy, tasty meals

         at home can be done in 20 minutes or less.

MAIN-20: CNN's Jason White interviewed Chef Billy to learn his secrets for

         healthy cooking on the run.

...

SUB-25: Health care difficulties in the Big Easy                                  (supplementary section)

!MATCHED: 200603010455/www.cnn.com/2006/EDUCATION/02/28/teaching.evolution.ap/index.html  (another matched page)

PATTERN: 200511210103/www.cnn.com/2005/POLITICS/11/20/bush.murtha/index.html      (layout pattern name)

SUB-0: CNN.com - Evolution debate continues - Feb 28, 2006                        (supplementary section)

TITLE: Evolution debate continues                                                 (article title)

SUB-17: Schools                                                                   (supplementary section)

SUB-17: Education

MAIN-20: SALT LAKE CITY (AP) -- House lawmakers scuttled a bill that would have   (main text)

         required public school students to be told that evolution is not

         empirically proven -- the latest setback for critics of evolution.


...
Share this:

by

Tags:

Comments

3 responses to “Splogs (or Snews) Using Web-Stemming”

  1. Asian Avatar

    Hi,
    Your blog got some pretty useful SEO info. But when you make up terms like “SNews” for splogs you should have checked on google if there are anything elese in that name. SNews is a very popular open source CMS. It has nothing to do with splogs.
    Regards,
    Neo

  2. Mike Avatar
    Mike

    Hey, “Asian”. This was meant to be a speculative post. And, yes I know I made up a term that means something else.

  3. Anna Smile Avatar
    Anna Smile

    Really cool blog you have. Let the Force be with you or like say my friend from Khazahstan “great success”!

Leave a Reply

Your email address will not be published. Required fields are marked *