当前位置：首页 > news >正文

Python实例题：Python抓取相亲网数据

news 2025/7/15 4:21:52

Python实例题

题目

Python抓取相亲网数据

python-crawl-dating-sitePython 抓取相亲网数据脚本

import requests
from bs4 import BeautifulSoup
import timedef crawl_dating_site(url):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}try:response = requests.get(url, headers=headers)response.raise_for_status()response.encoding = response.apparent_encodingsoup = BeautifulSoup(response.text, 'html.parser')profiles = []# 假设相亲信息在 class 为 'dating-profile' 的 div 中，实际需根据网页调整profile_divs = soup.find_all('div', class_='dating-profile')for profile_div in profile_divs:try:name = profile_div.find('span', class_='name').text.strip()age = profile_div.find('span', class_='age').text.strip()gender = profile_div.find('span', class_='gender').text.strip()profile = {'name': name,'age': age,'gender': gender}profiles.append(profile)except AttributeError:continuereturn profilesexcept requests.RequestException as e:print(f"请求出错: {e}")return []if __name__ == "__main__":base_url = 'https://example-dating-site.com/page'  # 替换为实际相亲网站 URLtotal_pages = 3  # 要抓取的总页数all_profiles = []for page in range(1, total_pages + 1):url = f'{base_url}{page}'profiles = crawl_dating_site(url)all_profiles.extend(profiles)print(f"第 {page} 页获取到 {len(profiles)} 条相亲信息。")time.sleep(2)  # 控制请求频率，避免被封禁 IPfor profile in all_profiles:print(f"姓名: {profile['name']}, 年龄: {profile['age']}, 性别: {profile['gender']}")

代码解释

请求头设置：
- 设置User - Agent请求头，模拟浏览器的请求行为，降低被反爬机制拦截的可能性。
crawl_dating_site函数：
- 发送 HTTP 请求获取指定相亲网页的内容。
- 使用BeautifulSoup解析 HTML 内容，查找相亲信息所在的 HTML 元素。
- 提取姓名、年龄、性别等信息，存储在字典中并添加到结果列表里。
- 处理请求异常，确保程序的健壮性。
主程序：
- 定义基础 URL 和要抓取的总页数。
- 循环构造不同页码的 URL，调用crawl_dating_site函数获取每一页的相亲信息。
- 使用time.sleep(2)控制请求频率，防止因频繁请求被网站封禁 IP。
- 打印所有获取到的相亲信息。