当前位置：首页 > ds >正文

爬虫之淘宝商品详情获取实战

ds 2025/7/14 23:08:09

淘宝作为国内大型电商平台，其反爬机制较为严格，获取商品详情需要综合运用网络请求、数据解析及反爬应对策略。以下将从环境搭建、技术实现到反爬处理进行全面实战讲解。

一、前期准备与环境搭建

1. 所需工具与库

Python 环境（建议 3.8+）
主要库：
- requests：发送 HTTP 请求获取网页内容
- BeautifulSoup/lxml：解析 HTML 数据
- json：处理 JSON 格式数据
- re：正则表达式提取特定信息
- selenium/Playwright：处理动态加载内容
- fake-useragent：生成随机 User-Agent
辅助工具：
- Chrome 浏览器及对应版本的 WebDriver
- Fiddler/Charles：抓包分析网络请求
- Postman：测试 API 接口

2. 淘宝商品链接分析

淘宝商品链接通常形如：
https://item.taobao.com/item.htm?id=商品ID
或短链接：https://detail.tmall.com/item.htm?id=商品ID
核心参数为id，即商品唯一标识符。

二、基础爬虫实现（基于 requests）

1. 基础请求框架

python

import requests
from fake_useragent import UserAgent
import time
import random
import re
import json# 随机User-Agent生成
ua = UserAgent()def get_taobao_item_detail(item_id):"""获取淘宝商品详情"""try:# 构造请求URLurl = f"https://detail.tmall.com/item.htm?id={item_id}"# 请求头设置（关键反爬策略）headers = {"User-Agent": ua.random,"Referer": f"https://search.tmall.com/search?q=商品搜索关键词","Accept": "text/html,application/xhtml+xml,application/xml","Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8","Cache-Control": "max-age=0","Upgrade-Insecure-Requests": "1","Cookie": "你的Cookie信息"  # 重要：登录状态Cookie可获取更多信息}# 发送请求（添加随机延时避免频繁请求）time.sleep(random.uniform(1, 3))  # 随机延时1-3秒response = requests.get(url, headers=headers, timeout=10)# 检查响应状态if response.status_code == 200:response.encoding = 'utf-8'return response.textelse:print(f"请求失败，状态码：{response.status_code}")return Noneexcept Exception as e:print(f"请求异常：{e}")return None

2. 数据解析（提取核心信息）

淘宝商品详情页数据通常以 JSON 形式嵌入 HTML 中，可通过正则表达式提取：

python

def parse_item_detail(html):"""解析商品详情HTML，提取关键信息"""if not html:return {}try:# 提取商品信息JSON（不同页面结构可能需要调整正则）match = re.search(r'g_page_config = (\{.*?\});', html)if match:config_json = json.loads(match.group(1))item_info = config_json.get('itemInfo', {})item = item_info.get('item', {})# 提取核心字段result = {"商品ID": item.get('id'),"商品标题": item.get('title'),"商品价格": item.get('price'),"原价": item.get('originalPrice'),"销量": item.get('sales'),"库存": item.get('stock'),"商品图片": item.get('image'),"详情页URL": item.get('detailUrl'),"店铺名称": config_json.get('shopInfo', {}).get('name'),"店铺ID": config_json.get('shopInfo', {}).get('id'),"评价数": config_json.get('commentInfo', {}).get('commentCount')}return resultelse:# 备选方案：直接解析HTML（适用于JSON提取失败的情况）from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')title = soup.find('h1', class_='tb-main-title')?.get_text(strip=True)price = soup.find('em', class_='tb-rmb-num')?.get_text(strip=True)sales = soup.find('div', class_='tb-sell-count')?.get_text(strip=True)return {"商品标题": title,"商品价格": price,"销量": sales}except Exception as e:print(f"解析异常：{e}")return {}

三、应对反爬机制（关键难点）

淘宝的反爬措施包括：

浏览器指纹识别
Cookie 有效性验证
滑块验证码
动态加载数据
请求频率限制

1. 进阶方案：使用 Selenium 模拟浏览器

python

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ECdef get_detail_with_selenium(item_id):"""使用Selenium模拟浏览器获取商品详情"""chrome_options = Options()# 可选：无头模式（隐藏浏览器窗口）# chrome_options.add_argument('--headless')chrome_options.add_argument(f'user-agent={ua.random}')chrome_options.add_argument('--disable-blink-features=AutomationControlled')  # 绕过webdriver检测chrome_options.add_experimental_option('excludeSwitches', ['enable-automation'])chrome_options.add_experimental_option('useAutomationExtension', False)# 启动浏览器driver = webdriver.Chrome(options=chrome_options)driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": """Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"""})try:url = f"https://detail.tmall.com/item.htm?id={item_id}"driver.get(url)# 等待页面加载完成（动态内容可能需要更长时间）WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'J_DetailMeta')))# 滚动页面加载更多内容（如详情页图片）driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")time.sleep(2)  # 等待滚动后内容加载# 获取页面源码html = driver.page_sourcereturn htmlexcept Exception as e:print(f"Selenium请求异常：{e}")return Nonefinally:driver.quit()

2. 反爬优化策略

Cookie 管理：
- 登录状态 Cookie（通过扫码登录获取）可访问更多数据
- 使用requests.Session()保持 Cookie 会话
请求频率控制：
- 随机延时（random.uniform(2, 5)）
- 限制每分钟请求数（如不超过 20 次）
IP 代理：
- 使用代理 IP 池（如阿布云、快代理）
- 示例（requests 中使用代理）：

python

proxies = {"http": "http://代理IP:端口","https": "https://代理IP:端口"
}
response = requests.get(url, headers=headers, proxies=proxies)

验证码处理：
- 复杂验证码需人工介入或使用打码平台（如超级鹰）
- Selenium 可模拟人工操作滑块

四、完整实战流程示例

python

# 1. 定义商品ID列表
item_ids = ["678901234567", "567890123456"]  # 替换为实际商品ID# 2. 遍历获取商品详情
all_items = []
for item_id in item_ids:print(f"正在获取商品ID：{item_id}")# 方案选择：优先使用requests，失败则切换至Seleniumhtml = get_taobao_item_detail(item_id)if not html:print(f"requests获取失败，尝试Selenium...")html = get_detail_with_selenium(item_id)# 解析数据item_data = parse_item_detail(html)if item_data:all_items.append(item_data)print(f"获取成功：{item_data['商品标题']}")else:print(f"解析失败，商品ID：{item_id}")# 间隔时间（避免频繁请求）time.sleep(random.uniform(3, 6))# 3. 保存数据（如JSON文件）
if all_items:with open(f"taobao_items_{time.strftime('%Y%m%d')}.json", 'w', encoding='utf-8') as f:json.dump(all_items, f, ensure_ascii=False, indent=2)print(f"数据已保存，共{len(all_items)}条商品信息")