Python爬虫入门
爬虫需要用到HTTP 请求的库、HTML/XML 解析的库、用于处理动态内容的库等
如request,lxml
首先通过get请求url,包括头信息和关键字
导入头文件
import requests
头信息是为了伪装成用户取访问该url
headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36'}
r=requests.get('https://b.faloo.com/1473774_1.html',headers=headers)
# r.encoding='utf-8'
可以修改获得的内容的编码
使用 from lxml import etree 将获取的r解析成html格式
a=etree.HTML(r.text)
通过定位获取具体的块的信息
info=a.xpath('//div[contains(@class, "noveContent")]/p/text()');
.text()获得正文内容
完整案例
import requests
from lxml import etree
import osheaders={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36'}num=1
url='https://www.bibie.cc/html/1/1.html'
books_num=1cur_title=""while True:r=requests.get(url,headers=headers)# r.encoding='utf-8'a=etree.HTML(r.text)info=a.xpath('//div[contains(@id,"chaptercontent")]/text()')title=a.xpath('//title/text()')next_url=a.xpath('//a[contains(@id,"pb_next")]/@href')parts=title[0].split('-')real_title=parts[0]if real_title!=cur_title:cur_title=real_titleos.mkdir('C:\\Users\\Administrator\\Desktop\\小说\\'+cur_title)with open('C:\\Users\\Administrator\\Desktop\\小说\\'+cur_title+'\\'+'第'+str(num)+'章.txt','a',encoding='utf-8') as f:print(cur_title+' 第'+str(num)+'章\n')for i in info:if i[0]=='请收藏本站:https://www.bibie.cc。笔趣阁手机版:https://m.bibie.cc':breakprint(i)f.write(i+'\n')f.write('\n\n')num=num+1print(next_url)if next_url[0]=='/html/'+str(books_num)+'/':books_num=books_num+1num=1url ='https://www.bibie.cc/html/'+str(books_num)+'/'+str(num)+'.html'# print(r.text)
lxml中的xpath('//表示跟标签/表示跟标签下的标签[contains(@属性:"属性值")]')