当前位置：首页 > web >正文

Python实例题：基于scrapy爬虫的天气数据采集

web 2025/8/22 7:23:08

Python实例题

题目

基于scrapy爬虫的天气数据采集(python)

weather_spider

spiders

weather_spider.py

import scrapy
from weather_spider.items import WeatherItem
import reclass WeatherSpider(scrapy.Spider):name = "weather"allowed_domains = ["weather.com.cn"]start_urls = ["http://www.weather.com.cn/textFC/hb.shtml"]  # 从华北地区开始def parse(self, response):"""解析地区页面，获取省份链接"""province_links = response.css("div.conMidtab2 a::attr(href)").getall()for link in province_links:if link.startswith("http"):yield scrapy.Request(link, callback=self.parse_city)else:yield scrapy.Request("http://www.weather.com.cn" + link, callback=self.parse_city)def parse_city(self, response):"""解析城市页面，获取天气数据"""city_name = response.css("div.crumbs a::text").getall()[-1]city_code = re.search(r'/(\d+)\.html', response.url).group(1) if re.search(r'/(\d+)\.html', response.url) else ""# 获取当天天气数据today_weather = response.css("div.today ul li")if len(today_weather) >= 4:item = WeatherItem()item["city_name"] = city_nameitem["city_code"] = city_codeitem["date"] = today_weather[0].css("::text").get()item["week"] = today_weather[1].css("::text").get()item["weather"] = today_weather[2].css("::text").get()item["temp_high"] = today_weather[3].css("span::text").get()item["temp_low"] = today_weather[3].css("i::text").get()item["wind"] = today_weather[4].css("::text").get() if len(today_weather) > 4 else ""yield item# 获取未来天气预报forecast_items = response.css("div.forecast ul li")for item in forecast_items:date_info = item.css("h1::text").get()if date_info:date_parts = date_info.split()if len(date_parts) >= 2:date = date_parts[0]week = date_parts[1]weather = item.css("p.wea::text").get()temp = item.css("p.tem span::text").get()temp_low = item.css("p.tem i::text").get()wind = item.css("p.win i::text").get()weather_item = WeatherItem()weather_item["city_name"] = city_nameweather_item["city_code"] = city_codeweather_item["date"] = dateweather_item["week"] = weekweather_item["weather"] = weatherweather_item["temp_high"] = tempweather_item["temp_low"] = temp_lowweather_item["wind"] = windyield weather_item

items.py

import scrapyclass WeatherItem(scrapy.Item):city_name = scrapy.Field()  # 城市名称city_code = scrapy.Field()  # 城市代码date = scrapy.Field()       # 日期week = scrapy.Field()       # 星期weather = scrapy.Field()    # 天气状况temp_high = scrapy.Field()  # 最高温度temp_low = scrapy.Field()   # 最低温度wind = scrapy.Field()       # 风力

pipelines.py

import json
import pymongoclass WeatherPipeline:def __init__(self):self.file = open('weather_data.json', 'w', encoding='utf-8')def process_item(self, item, spider):line = json.dumps(dict(item), ensure_ascii=False) + "\n"self.file.write(line)return itemdef close_spider(self, spider):self.file.close()class MongoPipeline:def __init__(self, mongo_uri, mongo_db):self.mongo_uri = mongo_uriself.mongo_db = mongo_db@classmethoddef from_crawler(cls, crawler):return cls(mongo_uri=crawler.settings.get('MONGO_URI'),mongo_db=crawler.settings.get('MONGO_DATABASE', 'weather'))def open_spider(self, spider):self.client = pymongo.MongoClient(self.mongo_uri)self.db = self.client[self.mongo_db]def close_spider(self, spider):self.client.close()def process_item(self, item, spider):self.db['weather_data'].insert_one(dict(item))return item

settings.py

BOT_NAME = 'weather_spider'SPIDER_MODULES = ['weather_spider.spiders']
NEWSPIDER_MODULE = 'weather_spider.spiders'# Obey robots.txt rules
ROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 8# Configure a delay for requests for the same website (default: 0)
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 4
CONCURRENT_REQUESTS_PER_IP = 0# Disable cookies (enabled by default)
COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
TELNETCONSOLE_ENABLED = False# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 500,
}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
EXTENSIONS = {'scrapy.extensions.logstats.LogStats': 500,
}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'weather_spider.pipelines.WeatherPipeline': 300,# 如果需要保存到MongoDB，取消下面的注释并配置MONGO_URI# 'weather_spider.pipelines.MongoPipeline': 400,
}# MongoDB配置
# MONGO_URI = 'mongodb://localhost:27017'
# MONGO_DATABASE = 'weather'# 日志配置
LOG_LEVEL = 'INFO'

代码解释

items.py：
- 定义了爬取数据的结构，包括城市名称、日期、天气状况等字段。
weather_spider.py：
- parse 方法：
  - 从地区页面解析出省份链接并发送请求。
- parse_city 方法：
  - 解析城市天气页面，提取当前天气和未来天气预报数据。
pipelines.py：
- WeatherPipeline：
  - 将爬取的数据保存为 JSON 文件。
- MongoPipeline：
  - 将数据存储到 MongoDB 数据库（需要取消相关注释并配置）。
settings.py：
- 配置爬虫的各种参数，如请求延迟、并发数、请求头和管道等。

运行思路

创建 Scrapy 项目：

scrapy startproject weather_spider
cd weather_spider
scrapy genspider weather weather.com.cn

替换代码：
- 将上述代码文件替换到对应的位置。
安装依赖：

pip install scrapy pymongo  # 如果需要保存到MongoDB

运行爬虫：

scrapy crawl weather

数据输出：
- 爬取的数据会保存到 weather_data.json 文件中，或根据配置保存到 MongoDB。

注意事项

爬取频率不宜过高，避免被网站封禁 IP（已设置 1 秒延迟）。
如需存储到 MongoDB，需取消 settings.py 中相关注释并正确配置连接信息。
网站结构可能会变化，若爬虫失效，需根据最新页面结构调整选择器。
该爬虫仅用于学习，爬取的数据仅供个人研究，请勿用于商业用途。

查看全文

http://www.xdnf.cn/news/7179.html

构建 TypoView：一个富文本样式预览工具的全流程记录

基于RDMA的跨节点GPU显存共享技术实践

Linux系统编程——system函数和popen函数的使用方法以及区别

ImgShrink：摄影暗房里的在线图片压缩工具开发记

C#里与嵌入式系统W5500网络通讯（2）

STM32SPI实战-Flash模板

第6章实战案例：基于 STEVAL-IDB011V1 板级 CI/CD 全流程

深入解析Java事件监听机制与应用

std::is_same

LOF算法（局部异常因子）python实现代码

Python实战案例：猜拳小游戏三人进阶版

如何在Java中使用Unsafe类或者ByteBuffer实现直接内存访问？

[创业之路-358]：从历史轮回到制度跃迁：中国共产党创业模式的超越性密码

有源晶振与无源晶振旁路模式与非旁路模式深度剖析

Go语言--语法基础5--基本数据类型--类型转换

LabVIEW汽车CAN总线检测系统开发

C++.备考知识点

Milvus向量数据库

Apache Spark：大数据处理与分析的统一引擎

iOS 内存分区

聚类算法K-means和Dbscan的对比

Blender建小房子流程

Python实例题：基于scrapy爬虫的天气数据采集

目录

Python实例题

题目

weather_spider

spiders

weather_spider.py

items.py

pipelines.py

settings.py

代码解释

`items.py`：

`weather_spider.py`：

`parse` 方法：

`parse_city` 方法：

`pipelines.py`：

`WeatherPipeline`：

`MongoPipeline`：

`settings.py`：

运行思路

创建 Scrapy 项目：

替换代码：

安装依赖：

运行爬虫：

数据输出：

注意事项

相关文章：

目录

Python实例题

题目

weather_spider

spiders

weather_spider.py

items.py

pipelines.py

settings.py

代码解释

items.py：

weather_spider.py：

parse 方法：

parse_city 方法：

pipelines.py：

WeatherPipeline：

MongoPipeline：

settings.py：

运行思路

创建 Scrapy 项目：

替换代码：

安装依赖：

运行爬虫：

数据输出：

注意事项

相关文章：

`items.py`：

`weather_spider.py`：

`parse` 方法：

`parse_city` 方法：

`pipelines.py`：

`WeatherPipeline`：

`MongoPipeline`：

`settings.py`：