Scraping Yahoo! Search with Web::Scraper in Python

次はhttp://menno.b10m.net/blog/blosxom/perl/scraping-yahoo-search-with-web-scraper.htmlと同じことをやってみる。

比較しやすいようにPerlのコードも並べてみた。
コードの量も見た目もあんまり変わらない。
Perlは括弧がなくてすっきり、Pythonの方はセミコロンがなくてすっきりしてる。

Python版は今まで"spam"という記法をサポートしてなかったので、
キーワード引数が'spam__list'な時にはループするようにした。
"
"の方が見た目が好きだけど、Pythonのキーワード引数に使える文字は限られているので、多少格好悪くても我慢する。(Djangoのlookup_typeも同じだし)

出力結果はPythonのはネストが深い。

use Data::Dumper;
use URI;
use Web::Scraper;

my $yahoo = scraper {
   process "/html/body/div[5]/div/div/div[2]/ol/li", 'results[]' => scraper {
      process "a.yschttl", 'title' => 'TEXT', 'url' => '@href';
      process "div.yschabstr", 'description' => "TEXT";

      result 'description', 'title', 'url';
   };
   result 'results';
};

print Dumper $yahoo->scrape( URI->new("http://search.yahoo.com/search?p=Perl") );

$ time ./ysearch.pl | head
$VAR1 = [
          {
            'url' => 'http://www.perl.com/',
            'title' => 'Perl.com',
            'description' => 'Central resource for Perl developers. It contains the Perl Language, edited by Tom Christiansen, and the Perl Reference, edited by Clay Irving.'
          },
          {
            'url' => 'http://www.perl.org/',
            'title' => 'Perl Mongers',
            'description' => 'Nonprofit organization, established to support the Perl community.'

real	0m1.391s
user	0m0.282s
sys	0m0.038s
#!/usr/bin/env python2.5
#-*- coding: utf-8 -*-
from scraper import scraper, process
import codecs, sys
sys.stdout = codecs.getwriter('utf-8')(sys.stdout)

yahoo = scraper(
    process('/html/body/div[5]/div/div/div[2]/ol/li', results__list=scraper(
        process('a.yschttl', title='TEXT', url='@href'),
        process('div.yschabstr', description="TEXT"))
   )
)

from pprint import pprint
pprint(yahoo.scrape('http://search.yahoo.com/search?p=Perl'))

$ time ./ysearch.py |head 
{'results': [{'description': 'Central resource for Perl developers. It contains the Perl Language, edited by Tom Christiansen, and the Perl Reference, edited by Clay Irving.',
              'title': 'Perl.com',
              'url': 'http://www.perl.com/'},
             {'description': 'Nonprofit organization, established to support the Perl community.',
              'title': 'Perl Mongers',
              'url': 'http://www.perl.org/'},
             {'description': 'Instructions on downloading a Perl interpreter for your computer platform. ... On CPAN, you will find Perl source in the /src directory. ...',
              'title': 'Getting Perl',
              'url': 'http://www.perl.com/download.csp'},
             {'description': 'Perl borrows features from a variety of other languages including C, shell ... Perl 3, released in 1989, added support for binary data streams. ...',

real	0m1.500s
user	0m0.112s
sys	0m0.268s