当前位置：首页 > news >正文

Python爬虫入门

news 2025/7/3 12:22:40

爬虫需要用到HTTP 请求的库、HTML/XML 解析的库、用于处理动态内容的库等

如request,lxml

首先通过get请求url,包括头信息和关键字

导入头文件

import requests

头信息是为了伪装成用户取访问该url

headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36'}

r=requests.get('https://b.faloo.com/1473774_1.html',headers=headers)
# r.encoding='utf-8'

可以修改获得的内容的编码

使用 from lxml import etree 将获取的r解析成html格式

a=etree.HTML(r.text)

通过定位获取具体的块的信息

info=a.xpath('//div[contains(@class, "noveContent")]/p/text()');

.text()获得正文内容

完整案例

import requests
from lxml import etree
import osheaders={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36'}num=1
url='https://www.bibie.cc/html/1/1.html'
books_num=1cur_title=""while True:r=requests.get(url,headers=headers)# r.encoding='utf-8'a=etree.HTML(r.text)info=a.xpath('//div[contains(@id,"chaptercontent")]/text()')title=a.xpath('//title/text()')next_url=a.xpath('//a[contains(@id,"pb_next")]/@href')parts=title[0].split('-')real_title=parts[0]if real_title!=cur_title:cur_title=real_titleos.mkdir('C:\\Users\\Administrator\\Desktop\\小说\\'+cur_title)with open('C:\\Users\\Administrator\\Desktop\\小说\\'+cur_title+'\\'+'第'+str(num)+'章.txt','a',encoding='utf-8') as f:print(cur_title+' 第'+str(num)+'章\n')for i in info:if i[0]=='请收藏本站：https://www.bibie.cc。笔趣阁手机版：https://m.bibie.cc':breakprint(i)f.write(i+'\n')f.write('\n\n')num=num+1print(next_url)if next_url[0]=='/html/'+str(books_num)+'/':books_num=books_num+1num=1url ='https://www.bibie.cc/html/'+str(books_num)+'/'+str(num)+'.html'# print(r.text)

lxml中的xpath('//表示跟标签/表示跟标签下的标签[contains(@属性:"属性值")]')

查看全文

http://www.xdnf.cn/news/14167.html