当前位置：首页 > news >正文

从单机到分布式：Python 爬虫架构演进

news 2025/8/31 5:07:14

第一章：单机爬虫起点与局限

1. 目标与读者

2. 环境准备

3. 最小可用爬虫（MVP）

4. 抗脆弱：重试、超时、随机 UA、礼貌抓取

5. 三种解析方式：CSS / XPath / 正则

6. 持久化：CSV / SQLite / MongoDB（示例：CSV）

7. 单机并发入门：ThreadPoolExecutor + 限速

8. 合规与风控清单（单机阶段必须养成的习惯）

9. 小结

10. 练习与思考题

第二章：框架化爬虫——Scrapy 提升工程化能力

1. 为什么需要框架？

2. Scrapy 的核心架构

3. Scrapy 快速上手

4. 编写第一个 Spider

5. Item Pipeline：数据清洗与存储

6. Middleware：请求增强

7. Scrapy 的优势与不足

8. 小结

第三章：异步与高并发——打破 I/O 瓶颈

1. Python 异步生态

2. 为什么异步适合爬虫？

3. aiohttp 爬虫示例

4. 异步爬虫的优缺点

5. 应用场景

第四章：分布式爬虫——从单机到集群的飞跃

1. 分布式爬虫的核心挑战

2. 基于消息队列的分布式方案

3. Scrapy-Redis：工程化分布式改造

4. Funboost：通用分布式任务调度

5. 分布式爬虫的存储与扩展

6. 适用场景

第五章：反爬对抗与智能化——攻守之间的演进

1. 常见反爬手段

2. 常见应对策略

3. 智能化与自动化趋势

4. 示例：破解字体反爬的 Python 逻辑

5. 适用场景与演进趋势

总结

第一章：单机爬虫起点与局限

1. 目标与读者

目标：写出稳定、可维护的单机爬虫；建立“工程化”的基础（日志、重试、限速、持久化）。
适合谁：已会 Python 基础语法，想把“脚本”升级为“靠谱工具”的同学。

2. 环境准备

Python ≥ 3.9
推荐库：requests, beautifulsoup4, lxml, tenacity（重试）, loguru（日志，可选）

pip install requests beautifulsoup4 lxml tenacity loguru

3. 最小可用爬虫（MVP）

import requests
from bs4 import BeautifulSoupresp = requests.get("https://example.com", timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
print(soup.title.get_text(strip=True))

要点：务必设置 timeout；用 raise_for_status() 让异常显式暴露。

4. 抗脆弱：重试、超时、随机 UA、礼貌抓取

import random, time
import requests
from tenacity import retry, stop_after_attempt, wait_exponentialUAS = ["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/118 Safari/537.36","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 Version/15.5 Safari/605.1.15",
]@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=8))
def fetch(url: str) -> str:headers = {"User-Agent": random.choice(UAS)}r = requests.get(url, headers=headers, timeout=10)r.raise_for_status()# 礼貌：简单限速，避免打爆网站time.sleep(random.uniform(0.5, 1.5))return r.text

要点：指数退避（wait_exponential）对临时性错误（429/5xx）更友好；加入随机延迟与随机 UA。

5. 三种解析方式：CSS / XPath / 正则

from bs4 import BeautifulSoup
from lxml import etree
import rehtml = fetch("https://example.com")# 1) CSS（BS4）
soup = BeautifulSoup(html, "lxml")
title_css = soup.select_one("title").get_text(strip=True)# 2) XPath（lxml）
dom = etree.HTML(html)
title_xpath = dom.xpath("string(//title)")# 3) 正则（兜底方案，不推荐首选）
match = re.search(r"<title>(.*?)</title>", html, flags=re.I|re.S)
title_re = match.group(1).strip() if match else Noneprint(title_css, title_xpath, title_re)

建议：优先 CSS/XPath；正则仅作兜底或局部抽取。

6. 持久化：CSV / SQLite / MongoDB（示例：CSV）

import csv, pathlib
from datetime import datetimeOUTPUT = pathlib.Path("data.csv")def save_csv(rows):exists = OUTPUT.exists()with OUTPUT.open("a", newline="", encoding="utf-8") as f:w = csv.DictWriter(f, fieldnames=["url", "title", "ts"])if not exists:w.writeheader()for r in rows:w.writerow(r)rows = [{"url": "https://example.com","title": title_css,"ts": datetime.utcnow().isoformat(),
}]
save_csv(rows)

要点：统一字段；保存 UTC 时间；文件追加写入并自动建表头。

7. 单机并发入门：`ThreadPoolExecutor` + 限速

from concurrent.futures import ThreadPoolExecutor, as_completed
from urllib.parse import urljoinBASE = "https://example.com/"
PATHS = ["/", "#", "/?page=2", "/about"]  # 示例路径
URLS = [urljoin(BASE, p) for p in PATHS]def parse_title(url: str) -> dict:html = fetch(url)soup = BeautifulSoup(html, "lxml")return {"url": url, "title": soup.title.get_text(strip=True)}results = []
with ThreadPoolExecutor(max_workers=8) as pool:futs = [pool.submit(parse_title, u) for u in URLS]for fut in as_completed(futs):try:data = fut.result()results.append({**data, "ts": datetime.utcnow().isoformat()})except Exception as e:print("error:", e)save_csv(results)

建议：