追随

http://use.perl.org/~miyagawa/journal/34461
また真似してUser-Agentの変更とhandlerの指定を出来るようにした。

UserAgentの変更がうまくいっているかをdebuglevelが1のHTTPHandlerをつかって
確認する例。

#!/usr/bin/env python2.5
#-*- coding: utf-8 -*-
from scraper import scraper, process

url = 'http://www.example.com/'

s = scraper(
    process('a', text='text', href='@href'),
)

import urllib2
http_handler = urllib2.HTTPHandler(debuglevel=1)
s.user_agent = 'Mozilla/5.0'
print s.scrape(url, http_handler)
$ ./example.py
connect: (www.example.com, 80)
send: 'GET / HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.example.com\r\nConnection: close\r\nUser-Agent: Mozilla/5.0\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Sun, 16 Sep 2007 07:04:25 GMT
header: Server: Apache/2.2.3 (CentOS)
header: Last-Modified: Tue, 15 Nov 2005 13:24:10 GMT
header: ETag: "280100-1b6-80bfd280"
header: Accept-Ranges: bytes
header: Content-Length: 438
header: Connection: close
header: Content-Type: text/html; charset=UTF-8
{'text': 'RFC \r\n  2606', 'href': 'http://www.rfc-editor.org/rfc/rfc2606.txt'}