爬虫请求模块(同步异步)
requests模块
-
requests模块支持HTTP连接保持和连接池,支持使用cookie保持会话,支持文件上传,支持自动响应内容的编码,支持国际化的URL和POST数据自动编码。
-
在python内置模块的基础上进行了高度的封装,从而使得python进行网络请求时,变得人性化,使用Requests可以轻而易举的完成浏览器可有的任何操作。
-
requests会自动实现持久连接keep-alive
简单请求
以下的请求都是快捷方法,利用的是requests构造好的session对象的request方法
requests.get(url) # GET请求
requests.post(url) # POST请求
requests.put(url) # PUT请求
requests.delete(url) # DELETE请求
requests.head(url) # HEAD请求
requests.options(url) # OPTIONS请求
参数
def request(method, url, **kwargs):
"""Constructs and sends a :class:`Request <Request>`.
1. 请求方法
:param method: method for the new :class:`Request` object: ``GET``, ``OPTIONS``, ``HEAD``, ``POST``, ``PUT``, ``PATCH``, or ``DELETE``.
2. 请求url
:param url: URL for the new :class:`Request` object.
3. 请求参数query类型
:param params: (optional) Dictionary, list of tuples or bytes to send
in the query string for the :class:`Request`.
4. 请求参数form类型
:param data: (optional) Dictionary, list of tuples, bytes, or file-like
object to send in the body of the :class:`Request`.
5. 请求参数json类型
:param json: (optional) A JSON serializable Python object to send in the body of the :class:`Request`.
6. 请求头
:param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
7. 请求cookies
:param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
8. 上传文件
:param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload.
``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')``
or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string
defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers
to add for the file.
9. auth
:param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
10. 超时时间
:param timeout: (optional) How many seconds to wait for the server to send data
before giving up, as a float, or a :ref:`(connect timeout, read
timeout) <timeouts>` tuple.
:type timeout: float or tuple
11. 是否允许跳转
:param allow_redirects: (optional) Boolean. Enable/disable GET/OPTIONS/POST/PUT/PATCH/DELETE/HEAD redirection. Defaults to ``True``.
:type allow_redirects: bool
12. proxies
:param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
proxies = {'http':'ip1:port1','https':'ip2:port2' }
如果报错
proxies = {'http':'http://ip1:port1','https':'https://ip2:port2' }
requests.get('url',proxies=proxies)
13. https校验
:param verify: (optional) Either a boolean, in which case it controls whether we verify
the server's TLS certificate, or a string, in which case it must be a path
to a CA bundle to use. Defaults to ``True``.
14. stream
:param stream: (optional) if ``False``, the response content will be immediately downloaded.
15. 证书
:param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
:return: :class:`Response <Response>` object
:rtype: requests.Response
"""
with sessions.Session() as session:
return session.request(method=method, url=url, **kwargs)
cookieJar
使用requests获取的resposne对象,具有cookies属性。该属性值是一个cookieJar类型,包含了对方服务器设置在本地的cookie。
# 获取cookies字典
cookies_dict = requests.utils.dict_from_cookiejar(response.cookies
代理ip(代理服务器)的分类
根据代理ip的匿名程度,代理IP可以分为下面三类:
透明代理(Transparent Proxy):透明代理虽然可以直接“隐藏”你的IP地址,但是还是可以查到你是谁。目标服务器接收到的请求头如下:
REMOTE_ADDR = Proxy IP
HTTP_VIA = Proxy IP
HTTP_X_FORWARDED_FOR = Your IP
匿名代理(Anonymous Proxy):使用匿名代理,别人只能知道你用了代理,无法知道你是谁。目标服务器接收到的请求头如下:
REMOTE_ADDR = proxy IP
HTTP_VIA = proxy IP
HTTP_X_FORWARDED_FOR = proxy IP
高匿代理(Elite proxy或High Anonymity Proxy):高匿代理让别人根本无法发现你是在用代理,所以是最好的选择。毫无疑问使用高匿代理效果最好。目标服务器接收到的请求头如下:
REMOTE_ADDR = Proxy IP
HTTP_VIA = not determined
HTTP_X_FORWARDED_FOR = not determined
根据网站所使用的协议不同,需要使用相应协议的代理服务。从代理服务请求使用的协议可以分为:
- http代理:目标url为http协议
- https代理:目标url为https协议
- socks隧道代理(例如socks5代理)等:
- socks 代理只是简单地传递数据包,不关心是何种应用协议(FTP、HTTP和HTTPS等)。
- socks 代理比http、https代理耗时少。
- socks 代理可以转发http和https的请求
会话对象
能够跨请求保持某些参数,适用于获取token再登录
with requests.session() as session:
# 获取token自动存储
response1 = session.get()
# 一个会话内自动携带token
response2 = session.post()
...
aiohttp
requests很强大,但它是同步的框架,对于异步http,有专门的aiohttp模块。
async模块简单使用可以参考鄙人拙作:async
使用
其余参数与requests.session()没有太大差别
import aiohttp
import asyncio
async def request():
# 相当于requests.session()
async with aiohttp.ClientSession() as session:
# 异步请求
async with session.get("https://example.com/imgs/20230116.jpg") as result:
# 读取字节码,相当于request.Response().content()
content = await result.content.read()
# 相当于request.Response().json()
await result.json()
# 相当于request.Response().text
await result.text()
# 阻塞代码
with open("20230116.jpg", "wb") as writer:
writer.write(content)
async def main():
tasks = [
asyncio.create_task(request())
]
await asyncio.wait(tasks)
if __name__ == '__main__':
# 官方推荐run函数来启动事件循环
asyncio.run(main())
aiofiles
上面的文件读取依旧是阻塞的,异步的文件操作模块有专用的aiofiles。
# 非阻塞代码
async with aiofiles.open("20230116.jpg", "wb") as writer:
await writer.write(content)
下载视频注意事项:
- 小视频
<video src="视频.mp4"/>
(适用于一些守旧的小平台,如:盗版网站) - 大视频:找到m3u8文件,下载ts文件,合成为mp4文件