当前位置：首页 > ds >正文

Python网络请求利器：urllib库深度解析

ds 2025/8/25 22:08:06

一、urllib库概述

urllib是Python内置的HTTP请求库，无需额外安装即可使用。它由四个核心模块构成：

urllib.request：发起HTTP请求的核心模块
urllib.error：处理请求异常（如404、超时等）
urllib.parse：解析和构造URL
urllib.robotparser：解析网站的robots.txt文件（较少使用）

相较于第三方库如requests，urllib更底层，适合需要精细控制请求的场景。

二、基础使用：GET请求

2.1 最简单的请求

import urllib.requestresponse = urllib.request.urlopen('https://www.baidu.com')
print(response.read().decode('utf-8'))  # 获取并解码网页内容

urlopen()返回HTTPResponse对象，包含状态码、头信息等属性
read()方法读取二进制响应内容，需用decode()转换为字符串

2.2 响应对象解析

print(response.status)        # 状态码（200表示成功）
print(response.getheaders())  # 响应头列表
print(response.getheader('Server'))  # 获取特定头信息

通过status和getheaders()可快速诊断请求状态

三、进阶请求控制

3.1 添加请求头

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
req = urllib.request.Request(url='https://www.baidu.com', headers=headers)
response = urllib.request.urlopen(req)

通过Request类构造复杂请求，模拟浏览器行为避免反爬

3.2 POST请求与参数编码

from urllib.parse import urlencodedata = urlencode({'key1': 'value1', 'key2': 'value2'}).encode('utf-8')
req = urllib.request.Request(url, data=data, method='POST')
response = urllib.request.urlopen(req)

urlencode将字典转为URL编码格式
设置method='POST'并传递二进制数据

四、异常处理机制

4.1 基础异常捕获

from urllib.error import URLError, HTTPErrortry:response = urllib.request.urlopen('http://invalid_url')
except HTTPError as e:print(f'HTTP错误码: {e.code}')
except URLError as e:print(f'URL错误: {e.reason}')

HTTPError处理4xx/5xx状态码
URLError处理网络层异常

4.2 超时控制

try:response = urllib.request.urlopen(url, timeout=0.1)
except TimeoutError:print("请求超时")

timeout参数避免长时间阻塞（单位：秒）

五、高级应用场景

5.1 文件下载

urllib.request.urlretrieve('https://example.com/image.jpg', 'local_image.jpg'
)

urlretrieve()直接保存网络资源到本地

5.2 代理设置

proxy_handler = urllib.request.ProxyHandler({'http': 'http://proxy.example.com:8080'})
opener = urllib.request.build_opener(proxy_handler)
urllib.request.install_opener(opener)
response = urllib.request.urlopen(url)

通过ProxyHandler实现代理访问

六、实战：构建健壮的爬虫

from urllib.parse import urljoindef robust_crawler(base_url):try:with urllib.request.urlopen(base_url, timeout=5) as response:if response.status == 200:html = response.read().decode('utf-8')# 使用parse模块解析相对路径links = [urljoin(base_url, link) for link in extract_links(html)]return linksexcept Exception as e:log_error(e)return []

此示例包含：