Most are fairly familiar with the current trend in splogs that are stuffed with three AdSense units above the fold of the page.
These pages use blackhat SEO techniques of scraping RSS feeds from multiple blogs. They are then displayed on Google (if the SEO-er is innovative enough).
If you use partial feeds, you’re safe, right?
Here’s a Python Webstemmer that takes it all to a new level.
“Snews” — scraping the news sites
Here’s their claimed accuracy:
New York Times 488.8/552.2 (88%) Newsday 373.7/454.7 (82%) Washington Post 342.6/367.3 (93%) Boston Globe 332.9/354.9 (93%) ABC News 299.7/344.4 (87%) BBC 283.3/337.4 (84%) Los Angels Times 263.2/345.5 (76%) Reuters 188.2/206.9 (91%) CBS News 171.8/190.1 (90%) Seattle Times 164.4/185.4 (89%) NY Daily News 144.3/147.4 (98%) International Herald Tribune 125.5/126.5 (99%) Channel News Asia 119.5/126.2 (94%) CNN 65.3/73.9 (89%) Voice of America 58.3/62.6 (94%) Independent 58.1/58.5 (99%) Financial Times 55.7/56.6 (98%) USA Today 44.5/46.7 (96%) NY1 35.7/37.1 (95%) 1010 Wins 14.3/16.1 (88%) Total 3829.1/4349.2 (88%)
It’s fairly accurate with an 88% average while scraping professional news sources. If you read a lot of news online, you’d be fairly familiar with how much separation of text there it — meaning, news items broken up with random ads. Now, how much easier would it be to scrape WordPress blogs that EACH have the SAME EXACT template structures? Not too much.
Below is how text is broken up:
$ cat cnn.txt !UNMATCHED: 200511210103/www.cnn.com/ (unmatched page)!UNMATCHED: 200511210103/www.cnn.com/privacy.html (unmatched page) !UNMATCHED: 200511210103/www.cnn.com/interactive_legal.html (unmatched page) ... !MATCHED: 200603010455/www.cnn.com/2006/HEALTH/02/09/billy.interview/index.html (matched page) PATTERN: 200511210103/www.cnn.com/2005/POLITICS/11/20/bush.murtha/index.html (layout pattern name) SUB-0: CNN.com - Too busy to cook? Not so fast - Feb 9, 2006 (supplementary section) TITLE: Too busy to cook? Not so fast (article title) SUB-10: Leading chef shares his secrets for speedy, healthy cooking (supplementary section) SUB-17: Corporate Governance (supplementary section) SUB-17: Lifestyle (House and Home) SUB-17: New You Resolution SUB-17: Billy Strynkowski MAIN-20: (CNN) -- A busy life can put the squeeze on healthy eating. But that (main text) doesn't have to be the case, according to Billy Strynkowski, executive chef of Cooking Light magazine. He says cooking healthy, tasty meals at home can be done in 20 minutes or less. MAIN-20: CNN's Jason White interviewed Chef Billy to learn his secrets for healthy cooking on the run. ... SUB-25: Health care difficulties in the Big Easy (supplementary section) !MATCHED: 200603010455/www.cnn.com/2006/EDUCATION/02/28/teaching.evolution.ap/index.html (another matched page) PATTERN: 200511210103/www.cnn.com/2005/POLITICS/11/20/bush.murtha/index.html (layout pattern name) SUB-0: CNN.com - Evolution debate continues - Feb 28, 2006 (supplementary section) TITLE: Evolution debate continues (article title) SUB-17: Schools (supplementary section) SUB-17: Education MAIN-20: SALT LAKE CITY (AP) -- House lawmakers scuttled a bill that would have (main text) required public school students to be told that evolution is not empirically proven -- the latest setback for critics of evolution.
...
Leave a Reply