2017-07-22

使用python抓取网页

面向过程编程方式

#get_page.py
import json
from urllib import urlopen
url="https://chhy2009.github.io"
response=urlopen(url)
contents=response.read()
text=contents.decode('utf8')
print(text)

面向对象方式

import requests
url="https://chhy2009.github.io"
response=requests.get(url)
print(response.status_code)
print(response.apparent_encoding)
response.encoding='utf-8' #可以使用这种方式改变编码
print(response.text)

上述为get方式，另外，requests也post方式。支持get, posts, put, delete, head, options等请求类型。

传递url参数
参数如果需要传递参数的话需要使用post方式，相对安全点，使用方式如下(get方式也一样)：

1 2	payload = {'key1': 'value1', 'key2': 'value2'} response=requests.post(url, payload)

超时
get 和 post方法都是阻塞式方法，为了防止服务器不能及时响应，可以在请求中携带timeout参数，
request库链接：Requests: 让 HTTP 服务人类