当前位置: 首页 > ops >正文

【Python爬虫与数据分析】爬虫代理IP与访问控制

目录

一、代理IP

二、正则表达式re

三、通过代理IP对网站循环访问

四、通过selenium工具实现访问控制


注:文末有干货,不过不认真看完你可学不懂!(偷笑

一、代理IP

在爬虫对服务器做资源请求时,通常情况是不需要用到代理IP的,但是如果需要频繁的访问某个服务器,为了避开服务器的反爬机制,我们需要用代理IP来伪装自己爬虫的真实身份,使服务器无法封锁我们真正的IP地址。

代理IP可以并不只是仅仅伪装ip地址,还包括了整个请求头里的信息:

  • User-Agent:访问资源的浏览器信息
  • Referer:访问资源的跳转路径
  • Cookie:访问资源的参数

请求头里面的信息可以视情况进行添加或伪装,如不填写会使用浏览器的默认值。

有时候不对请求头进行填写或伪装也可以访问到资源,通常情况访问一些需要特殊权限(如VIP权限)的资源,是需要拿到足够权限的Cookie值才能访问到的。

代理IP地址的获取途径通常是去代理IP的资源网站获取,这里推荐一个:

http://www.kxdaili.com/dailiip.html

通过简单的爬虫技术(HTML数据解析),即可从这个网站获取免费的100个代理IP,将每个代理IP以字典格式 {协议: ip地址} 存入列表,即构成了代理IP池。

import requests
from lxml import etreeproxies_lst = []
for i in range(1, 11):ip_url = f'http://www.kxdaili.com/dailiip/1/{i}.html'# http://www.kxdaili.com/dailiip/1/2.html# http://www.kxdaili.com/dailiip/1/3.htmlresponse = requests.get(ip_url)# print(response.text)html = response.texthtml = etree.HTML(html)ip_lst = html.xpath('//div[@class="header-container"]/div[2]/div[2]/div/div[2]/table/tbody/tr')# print(ip_lst)# print(len(ip_lst))for ip_info in ip_lst:ip = ip_info.xpath('./td[1]/text()')[0]port = ip_info.xpath('./td[2]/text()')[0]ht = ip_info.xpath('./td[4]/text()')[0]# print(ip, port, ht)proxies_info = {ht: ip + ':' + port}proxies_lst.append(proxies_info)for i in proxies_lst:print(i)
print(len(proxies_lst))

Cookie通常是不好做伪装的,如果资源对Cookie有限制,那么有则用,没有则一般是访问不到的,需要找其他办法(本人爬虫弱鸡暂无其他办法)。

对 User-Agent 和 Referer 做伪装,再通过random随机库随机获取,代理IP的获取也是随机从代理IP池里面获取,所以代理IP池的容量越大越好(重复IP的使用频率越低):

import randomuser_agent_list=['Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0)','Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.0;Trident/4.0)','Mozilla/4.0(compatible;MSIE7.0;WindowsNT6.0)','Opera/9.80(WindowsNT6.1;U;en)Presto/2.8.131Version/11.11','Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER','Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)','Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
]referer_list=['http://blog.csdn.net/dala_da/article/details/79401163','http://blog.csdn.net/','https://www.sogou.com/tx?query=%E4%BD%BF%E7%94%A8%E7%88%AC%E8%99%AB%E5%88%B7csdn%E8%AE%BF%E9%97%AE%E9%87%8F&hdq=sogou-site-706608cfdbcc1886-0001&ekv=2&ie=utf8&cid=qb7.zhuye&','https://www.baidu.com/s?tn=98074231_1_hao_pg&word=%E4%BD%BF%E7%94%A8%E7%88%AC%E8%99%AB%E5%88%B7csdn%E8%AE%BF%E9%97%AE%E9%87%8F'
]user_agent = random.choice(user_agent_list)
referer = random.choice(referer_list)

二、正则表达式re

正则表达式的re模块是Python中处理字符串数据的重要方式,不过正则表达式的语法相当复杂,本文不做细说,只简单说说re在爬虫常用的一些功能。

在使用爬虫的很多时候,我们需要从字符串中提取到部分信息,特别是从某一个url链接之中提取信息。

一个URL链接,通常包括:协议(https://)、域名(www.baidu.com)、资源路径、参数,在很多时候,链接中的资源路径和参数里面会有我们需要的字符串字段,这时候就需要我们使用re正则表达式做字符串切割,拿到我们需要的数据。

示例一:https://blog.csdn.net/phoenixFlyzzz

获取示例一的url链接中的用户ID:

import reurl = "https://blog.csdn.net/phoenixFlyzzz"
user_id = re.split("/", url)[3]
print(user_id)
# phoenixFlyzzz

由此可知,re.split()函数可以进行字符串切割,并且将切割之后的字符串以列表的形式存储。

示例二:https://blog.csdn.net/phoenixFlyzzz?type=blog

获取示例二的url链接中的用户ID:

import reurl = "https://blog.csdn.net/phoenixFlyzzz"
user_id = re.split("/|\?", user_url)[3]
print(user_id)
# phoenixFlyzzz

由此可见,re.split()函数可以定义多个字符进行切割,此处是定义了 / 和 ? 进行切割, | 用于分割切割符,\ 是因为 ? 有其他含义,用 \ 转义字符将其变为问号本身。

三、通过代理IP对网站循环访问

当使用爬虫对某网站频繁访问的时候,切忌访问太过频繁,这样会加大服务器的资源开销,一定要控制好访问的频率,通过time时间模块进行代码的休眠控制。

(郑重声明:本文所有代码仅供学习使用,不能用作任何商业用途)

这是一个自动循环访问博客的爬虫:

import requests
from lxml import etree
import random
import time
import re
import jsonuser_url = input('请输入用户的url: ')# 通过主页链接获取用户的全部文章url
# 用re正则表达式从user_url中获得user_id
user_id = re.split("/|\?", user_url)[3]
json_url = f'https://blog.csdn.net/community/home-api/v1/get-business-list?page=1&size=20&businessType=blog&orderby=&noMore=false&year=&month=&username={user_id}'# 请求json资源包
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36','referer': user_url,'cookie': 'uuid_tt_dd=10_3110927480-1676090223071-792047; __bid_n=1863ec38aea95f6a424207; UN=phoenixFlyzzz; p_uid=U010000; _ga=GA1.2.993941723.1676213175; historyList-new=%5B%5D; Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac=6525*1*10_3110927480-1676090223071-792047!5744*1*phoenixFlyzzz; FPTOKEN=rGJaKVnrAyrd9c6PNrWR621PRkeUFNL5oQN+ZcnMlhc1gi9jUB2f+3Lre4ssgxxkoHCAjPSQg38FYQVulxS85MVFhuGNp4Tj1sDo6/tLmWw+NYhN9elmUgZ6NEC48t5v2yT3LT4H61ZZJyeAvtv55Yd0cn6v3uEN4FoVd0mM1x2hF/Qz68/K5Hf63vIdlfpl+urOIv9VIuQSmABf0uxvOnsxMnMJOZInkuHt8hsy1qna5lTtPF6VWxTUPIC8dvoTqbr67BjcuEi4naB2tLElGXT5TjgnoWsInXpmD6ABYeF630/ex1x49imDOOKTGvYoNrbA4gYKSh3ePcRv1K8FPNuI8oRj1F+4gFTT9dJcgeK3lI4wO+NY0TiAAgWS4k8VpuntN0kYay1eKtUE2En3sA==|lzoBrn2+9F0BmgSIvcEt7t/AAp7YH4Yr0nrG43bNJ48=|10|fd2bfb9200cc0d87abf868edf8f4d31a; dp_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpZCI6MTk3NjQ1MCwiZXhwIjoxNjg5NTMxMzE2LCJpYXQiOjE2ODg5MjY1MTYsInVzZXJuYW1lIjoicGhvZW5peEZseXp6eiJ9.rg0DgrqX7TQWPJosI-6OKmQtAraxmyBMfg0H0xerRpY; log_Id_view=24395; management_ques=1689227893320; hide_login=1; c_dl_fref=https://so.csdn.net/so/search; c_dl_prid=1689264739921_862614; c_dl_rid=1689264756287_665500; c_dl_fpage=/download/weixin_38722164/13767050; c_dl_um=distribute.pc_search_result.none-task-download-2%7Eall%7Efirst_rank_ecpm_v1%7Erank_v31_ecpm-3-13993802-null-null.142%5Ev88%5Econtrol_2%2C239%5Ev2%5Einsert_chatgpt; loginbox_strategy=%7B%22taskId%22%3A270%2C%22abCheckTime%22%3A1689240353169%2C%22version%22%3A%22notInDomain%22%2C%22blog-sixH-default%22%3A1689265737075%7D; UserName=phoenixFlyzzz; UserInfo=e8f9153e71c94dcabecc0827927e50c5; UserToken=e8f9153e71c94dcabecc0827927e50c5; UserNick=%E5%91%BD%E8%BF%90on-9; AU=D18; BT=1689265829191; Hm_up_6bcd52f51e9b3dce32bec4a3997715ac=%7B%22islogin%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isonline%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isvip%22%3A%7B%22value%22%3A%220%22%2C%22scope%22%3A1%7D%2C%22uid_%22%3A%7B%22value%22%3A%22phoenixFlyzzz%22%2C%22scope%22%3A1%7D%7D; log_Id_pv=3995; log_Id_click=6559; firstDie=1; Hm_lvt_e5ef47b9f471504959267fd614d579cd=1689268533; ssxmod_itna=QqGxgDnQGQ57qYKGHAonx02jRG8KqHYbii1mDlO3xA5D8D6DQeGTb0Y7eb=d1e7DCqfsqYZ2x3QtiA8GhmtCnxPhfmmDB3DEx0=KmCYxiinDCeDIDWeDiDGR7D=xGYDj0F/C9Dm4i7DYqGRDB6UCqDf+qGW7uQDmLNDGup6D7QDIw6g9R2DLeDSK7Ub7qDMUeGXSDa47dRWHpGMITnbWePuKCiDtqD94m=DbfL3x0pyRTrz88hr9OxQmG3Y4rqeY7DImDesQADe4SeYQD+GYGGNS7xj9O44DD3YY01beD===; ssxmod_itna2=QqGxgDnQGQ57qYKGHAonx02jRG8KqHYbii1D61frD0HPe031i70peDy09Dqn4nDkt7ORHokSGi0vxmjCBqhiF1l60OcsTX9M3e1ic/ZEcEBQSlbnEfMopKrUz54r8XGHYIckRuyTyWHEPm7novTcYFbdaYr2AYr/h51QKu73a9p5fENTb9sHRYzSeBAjeBCjB5sUmo10jn7CPTx6eTjqrAEe8Et9pfUtZLTCOSwFIkveM3dxNKhj/7fdPkb04uD1incIipNa=F7X=m1Kw974UDtx6DKq0RN9cdldWU=7DNq/CFzUpPeEf5BYrlD11YiPEsu0YjR=9EoZTxK2bBu=l3GYAbwds9EKAwqMuo1hrkCmLx1srOsmrlkY1oQiW5VYQ6ez6oI9jw+jt/0wRlYZ0wanNXrkUgmRmHTrd4SwObIMOE5uoWqKdAzjGrzEPVg5aqzRuwUQrlWhK2W4S5lMvKrjguYGdE6amV4OnuYspEiOQmWYvDDwc4DjKDewD4D=; c_utm_source=edu_txxl_mh; dc_session_id=10_1689309742332.208593; c_first_ref=default; c_segment=15; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1688911197,1688917774,1689304257,1689309744; dc_sid=a1dffd08dd905125e95cd269df2ea4bc; FCNEC=%5B%5B%22AKsRol92q1iv8tx72fkK9bOYJMj_ruoB23PUFbGwA9z1pdh2biHzNAYEWChj9ex5C9gx7naL_pBnalXM2c1sI4Z6eFDqouJ775-0J12K75yqXnRA5tCEXkZiuEAZmQkJKkEPP--Di9CH84WWirUA2luc25OT2gWTBA%3D%3D%22%5D%2Cnull%2C%5B%5D%5D; csrfToken=PWrKJ_3MqdFIcAdzeDpS99mD; __gads=ID=be94ab085530c60b-22868fbfd3d900f6:T=1676560572:RT=1689312851:S=ALNI_MYNNxc0dxyRCaKnMGQnAKL5Qppr5g; __gpi=UID=00000bc4df7125c3:T=1676560572:RT=1689312851:S=ALNI_MZVPQ9kZkGSCUXxaL5KbHyGT69GBQ; log_Id_click=6560; c_utm_medium=distribute.pc_feed_blog.none-task-blog-personrec_tag-1-131698929-null-null.nonecase; https_waf_cookie=b23550e2-1410-49c5e754af82b31d803cdb7794d5e2b68935; log_Id_pv=3996; c_pref=default; c_first_page=https%3A//blog.csdn.net/m0_61780496; c_dsid=11_1689314745151.983284; c_ref=https%3A//blog.csdn.net/liusuihong919520/article/details/131698929%3Fspm%3D1001.2100.3001.7377%26utm_medium%3Ddistribute.pc_feed_blog.none-task-blog-personrec_tag-1-131698929-null-null.nonecase%26depth_1-utm_source%3Ddistribute.pc_feed_blog.none-task-blog-personrec_tag-1-131698929-null-null.nonecase; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1689315357; c_page_id=default; dc_tos=rxrw3v'
}json_response = requests.get(json_url, headers=headers)
time.sleep(2)article_info_lst = []
json_data = json.loads(json_response.text)
article_num = json_data['data']['total']
print(f'article_num={article_num}')n = article_num // 20 + 1
try:for i in range(n):json_url = f'https://blog.csdn.net/community/home-api/v1/get-business-list?page={i+1}&size=20&businessType=blog&orderby=&noMore=false&year=&month=&username={user_id}'json_response = requests.get(json_url, headers=headers)json_data = json.loads(json_response.text)article_lst = json_data['data']['list']for article in article_lst:article_info_lst.append((article['url'], article['title']))
except:print(Exception)# 获取代理IP
proxies_lst = []
for i in range(1, 11):ip_url = f'http://www.kxdaili.com/dailiip/1/{i}.html'# http://www.kxdaili.com/dailiip/1/2.html# http://www.kxdaili.com/dailiip/1/3.htmlresponse = requests.get(ip_url)# print(response.text)html = response.texthtml = etree.HTML(html)ip_lst = html.xpath('//div[@class="header-container"]/div[2]/div[2]/div/div[2]/table/tbody/tr')# print(ip_lst)# print(len(ip_lst))for ip_info in ip_lst:ip = ip_info.xpath('./td[1]/text()')[0]port = ip_info.xpath('./td[2]/text()')[0]ht = ip_info.xpath('./td[4]/text()')[0]# print(ip, port, ht)proxies_info = {ht: ip + ':' + port}proxies_lst.append(proxies_info)for i in proxies_lst:print(i)
print(len(proxies_lst))# 伪装浏览器和浏览足迹
user_agent_list=['Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0)','Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.0;Trident/4.0)','Mozilla/4.0(compatible;MSIE7.0;WindowsNT6.0)','Opera/9.80(WindowsNT6.1;U;en)Presto/2.8.131Version/11.11','Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER','Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)','Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
]referer_list=['http://blog.csdn.net/dala_da/article/details/79401163','http://blog.csdn.net/','https://www.sogou.com/tx?query=%E4%BD%BF%E7%94%A8%E7%88%AC%E8%99%AB%E5%88%B7csdn%E8%AE%BF%E9%97%AE%E9%87%8F&hdq=sogou-site-706608cfdbcc1886-0001&ekv=2&ie=utf8&cid=qb7.zhuye&','https://www.baidu.com/s?tn=98074231_1_hao_pg&word=%E4%BD%BF%E7%94%A8%E7%88%AC%E8%99%AB%E5%88%B7csdn%E8%AE%BF%E9%97%AE%E9%87%8F'
]test_num = 1
while True:print(f'第{test_num}轮')test_num += 1for article in article_info_lst:url = article[0]headers = {'Referer': random.choice(referer_list),'User-Agent': random.choice(user_agent_list)}pos = random.randint(0, len(proxies_lst) - 1)proxies = proxies_lst[pos]try:response = requests.get(url, headers=headers, proxies=proxies)html = response.texthtml = etree.HTML(html)read_num = html.xpath('//*[@id="mainBox"]/main/div/div/div/div[2]/div/div/span[@class="read-count"]/text()')[0]except ValueError:breakelse:print(f'状态码: {response.status_code}, ', end='')if response.status_code == 200:print(f'{url}访问成功,当前访问量为: {read_num}, 当前ip: {proxies}')else:print(f'{url}访问失败')time.sleep(1)time.sleep(10)

四、通过selenium工具实现访问控制

selenium工具是一个网站的自动化测试工具,在很多时候也用于爬虫爬取资源,不过selenium的效率相比于requests慢很多,所以很多时候能用requests直接拿到资源就不用selenium。

在很多爬虫之中,selenium对于资源的爬取只是一个辅助作用,它通过对浏览器的可视化访问控制,方便程序员对爬虫代码进行编写和优化。

通过selenium和requests可以轻松拿到前端代码,也可以通过selenium控制的访问按键改变浏览器路径,进行相关资源的访问或循环访问(翻页访问)。

拿到资源之后,便是对数据做处理,通过HTML或Json数据解析,提取到我们想要的数据,再做数据处理。

这是一个自动登录和批量三连(关注、点赞、评论)博客的爬虫:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from lxml import etree
import random
import time
import re
import json
import requests# 配置无头浏览器
opt = Options()
opt.add_argument("--headless")
opt.add_argument("--disable-gpu")# 打开浏览器,无头浏览器,可设可不设
driver = webdriver.Chrome(options=opt)
# driver = webdriver.Chrome()# 登录
url = "https://passport.csdn.net/login"driver.get(url)
time.sleep(2)driver.find_element(By.XPATH, "/html/body/div[2]/div/div[2]/div[2]/div[1]/div/div[1]/span[4]").click()
time.sleep(2)# 填写自己登录的账号密码
id_number = input('请输入你的csdn账号: ')
password = input('请输入你的csdn密码: ')
driver.find_element(By.XPATH, '/html/body/div[2]/div/div[2]/div[2]/div[1]/div/div[2]/div/div[1]/div/input').send_keys(f'{id_number}')
driver.find_element(By.XPATH, '/html/body/div[2]/div/div[2]/div[2]/div[1]/div/div[2]/div/div[2]/div/input').send_keys(f'{password}')
time.sleep(2)
driver.find_element(By.XPATH, '/html/body/div[2]/div/div[2]/div[2]/div[1]/div/div[2]/div/div[4]/button').click()
time.sleep(2)# 用户主页
user_url = input('请输入目标博主的主页链接:')
driver.get(user_url)
time.sleep(2)# 用re正则表达式从user_url中获得user_id
user_id = re.split("/|\?", user_url)[3]
json_url = f'https://blog.csdn.net/community/home-api/v1/get-business-list?page=1&size=20&businessType=blog&orderby=&noMore=false&year=&month=&username={user_id}'# 关注
try:driver.find_element(By.LINK_TEXT, '关注').click()print(f'关注{user_id}成功')time.sleep(2)
except:print(f'用户{user_id}已关注')# 请求json资源包
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36','referer': user_url,'cookie': 'uuid_tt_dd=10_3110927480-1676090223071-792047; __bid_n=1863ec38aea95f6a424207; UN=phoenixFlyzzz; p_uid=U010000; _ga=GA1.2.993941723.1676213175; historyList-new=%5B%5D; Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac=6525*1*10_3110927480-1676090223071-792047!5744*1*phoenixFlyzzz; FPTOKEN=rGJaKVnrAyrd9c6PNrWR621PRkeUFNL5oQN+ZcnMlhc1gi9jUB2f+3Lre4ssgxxkoHCAjPSQg38FYQVulxS85MVFhuGNp4Tj1sDo6/tLmWw+NYhN9elmUgZ6NEC48t5v2yT3LT4H61ZZJyeAvtv55Yd0cn6v3uEN4FoVd0mM1x2hF/Qz68/K5Hf63vIdlfpl+urOIv9VIuQSmABf0uxvOnsxMnMJOZInkuHt8hsy1qna5lTtPF6VWxTUPIC8dvoTqbr67BjcuEi4naB2tLElGXT5TjgnoWsInXpmD6ABYeF630/ex1x49imDOOKTGvYoNrbA4gYKSh3ePcRv1K8FPNuI8oRj1F+4gFTT9dJcgeK3lI4wO+NY0TiAAgWS4k8VpuntN0kYay1eKtUE2En3sA==|lzoBrn2+9F0BmgSIvcEt7t/AAp7YH4Yr0nrG43bNJ48=|10|fd2bfb9200cc0d87abf868edf8f4d31a; dp_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpZCI6MTk3NjQ1MCwiZXhwIjoxNjg5NTMxMzE2LCJpYXQiOjE2ODg5MjY1MTYsInVzZXJuYW1lIjoicGhvZW5peEZseXp6eiJ9.rg0DgrqX7TQWPJosI-6OKmQtAraxmyBMfg0H0xerRpY; log_Id_view=24395; management_ques=1689227893320; hide_login=1; c_dl_fref=https://so.csdn.net/so/search; c_dl_prid=1689264739921_862614; c_dl_rid=1689264756287_665500; c_dl_fpage=/download/weixin_38722164/13767050; c_dl_um=distribute.pc_search_result.none-task-download-2%7Eall%7Efirst_rank_ecpm_v1%7Erank_v31_ecpm-3-13993802-null-null.142%5Ev88%5Econtrol_2%2C239%5Ev2%5Einsert_chatgpt; loginbox_strategy=%7B%22taskId%22%3A270%2C%22abCheckTime%22%3A1689240353169%2C%22version%22%3A%22notInDomain%22%2C%22blog-sixH-default%22%3A1689265737075%7D; UserName=phoenixFlyzzz; UserInfo=e8f9153e71c94dcabecc0827927e50c5; UserToken=e8f9153e71c94dcabecc0827927e50c5; UserNick=%E5%91%BD%E8%BF%90on-9; AU=D18; BT=1689265829191; Hm_up_6bcd52f51e9b3dce32bec4a3997715ac=%7B%22islogin%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isonline%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isvip%22%3A%7B%22value%22%3A%220%22%2C%22scope%22%3A1%7D%2C%22uid_%22%3A%7B%22value%22%3A%22phoenixFlyzzz%22%2C%22scope%22%3A1%7D%7D; log_Id_pv=3995; log_Id_click=6559; firstDie=1; Hm_lvt_e5ef47b9f471504959267fd614d579cd=1689268533; ssxmod_itna=QqGxgDnQGQ57qYKGHAonx02jRG8KqHYbii1mDlO3xA5D8D6DQeGTb0Y7eb=d1e7DCqfsqYZ2x3QtiA8GhmtCnxPhfmmDB3DEx0=KmCYxiinDCeDIDWeDiDGR7D=xGYDj0F/C9Dm4i7DYqGRDB6UCqDf+qGW7uQDmLNDGup6D7QDIw6g9R2DLeDSK7Ub7qDMUeGXSDa47dRWHpGMITnbWePuKCiDtqD94m=DbfL3x0pyRTrz88hr9OxQmG3Y4rqeY7DImDesQADe4SeYQD+GYGGNS7xj9O44DD3YY01beD===; ssxmod_itna2=QqGxgDnQGQ57qYKGHAonx02jRG8KqHYbii1D61frD0HPe031i70peDy09Dqn4nDkt7ORHokSGi0vxmjCBqhiF1l60OcsTX9M3e1ic/ZEcEBQSlbnEfMopKrUz54r8XGHYIckRuyTyWHEPm7novTcYFbdaYr2AYr/h51QKu73a9p5fENTb9sHRYzSeBAjeBCjB5sUmo10jn7CPTx6eTjqrAEe8Et9pfUtZLTCOSwFIkveM3dxNKhj/7fdPkb04uD1incIipNa=F7X=m1Kw974UDtx6DKq0RN9cdldWU=7DNq/CFzUpPeEf5BYrlD11YiPEsu0YjR=9EoZTxK2bBu=l3GYAbwds9EKAwqMuo1hrkCmLx1srOsmrlkY1oQiW5VYQ6ez6oI9jw+jt/0wRlYZ0wanNXrkUgmRmHTrd4SwObIMOE5uoWqKdAzjGrzEPVg5aqzRuwUQrlWhK2W4S5lMvKrjguYGdE6amV4OnuYspEiOQmWYvDDwc4DjKDewD4D=; c_utm_source=edu_txxl_mh; dc_session_id=10_1689309742332.208593; c_first_ref=default; c_segment=15; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1688911197,1688917774,1689304257,1689309744; dc_sid=a1dffd08dd905125e95cd269df2ea4bc; FCNEC=%5B%5B%22AKsRol92q1iv8tx72fkK9bOYJMj_ruoB23PUFbGwA9z1pdh2biHzNAYEWChj9ex5C9gx7naL_pBnalXM2c1sI4Z6eFDqouJ775-0J12K75yqXnRA5tCEXkZiuEAZmQkJKkEPP--Di9CH84WWirUA2luc25OT2gWTBA%3D%3D%22%5D%2Cnull%2C%5B%5D%5D; csrfToken=PWrKJ_3MqdFIcAdzeDpS99mD; __gads=ID=be94ab085530c60b-22868fbfd3d900f6:T=1676560572:RT=1689312851:S=ALNI_MYNNxc0dxyRCaKnMGQnAKL5Qppr5g; __gpi=UID=00000bc4df7125c3:T=1676560572:RT=1689312851:S=ALNI_MZVPQ9kZkGSCUXxaL5KbHyGT69GBQ; log_Id_click=6560; c_utm_medium=distribute.pc_feed_blog.none-task-blog-personrec_tag-1-131698929-null-null.nonecase; https_waf_cookie=b23550e2-1410-49c5e754af82b31d803cdb7794d5e2b68935; log_Id_pv=3996; c_pref=default; c_first_page=https%3A//blog.csdn.net/m0_61780496; c_dsid=11_1689314745151.983284; c_ref=https%3A//blog.csdn.net/liusuihong919520/article/details/131698929%3Fspm%3D1001.2100.3001.7377%26utm_medium%3Ddistribute.pc_feed_blog.none-task-blog-personrec_tag-1-131698929-null-null.nonecase%26depth_1-utm_source%3Ddistribute.pc_feed_blog.none-task-blog-personrec_tag-1-131698929-null-null.nonecase; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1689315357; c_page_id=default; dc_tos=rxrw3v'
}json_response = requests.get(json_url, headers=headers)
time.sleep(2)article_info_lst = []
json_data = json.loads(json_response.text)
article_num = json_data['data']['total']
print(f'article_num={article_num}')n = article_num // 20 + 1
try:for i in range(n):json_url = f'https://blog.csdn.net/community/home-api/v1/get-business-list?page={i+1}&size=20&businessType=blog&orderby=&noMore=false&year=&month=&username={user_id}'json_response = requests.get(json_url, headers=headers)json_data = json.loads(json_response.text)article_lst = json_data['data']['list']for article in article_lst:article_info_lst.append((article['url'], article['title']))
except:print(Exception)article_num = 0
# 每天的评论上限为10次
for article_info in article_info_lst:article_num += 1driver.get(article_info[0])time.sleep(3)# 页面滑动js = 'window.scrollTo(0, 1000)'  # 向下滑driver.execute_script(js)time.sleep(1)# 点赞,若已经赞过则不点,而且点过赞说明也评论过,可以直接跳过不评论html_data = driver.page_sourcehtml_data = etree.HTML(html_data)flag = html_data.xpath('/html/body/div[3]/div/main/div[2]/div/div[2]/ul/li[1]/a/img[3]/@style')[0]if flag == 'display:none':print(f'第{article_num}篇文章:{article_info[1]},该文章已经点赞过')continueelse:driver.find_element(By.XPATH, '/html/body/div[3]/div/main/div[2]/div/div[2]/ul/li[1]').click()# 评论content_lst = ['博主讲解得太详细了,通俗易懂,优质好文,必须三连支持!!!','感谢博主细致的讲解,让我豁然开朗,非常感谢, 三连支持一波!!!','非常优秀的博文,感谢博主!!!三连奉上!!!','复习打卡冲冲冲,一起加油呀!!!感谢博主的细致讲解','正在学习这方面的知识,这篇博文对我的帮助很大,非常感谢!']# 如果是对自己的文章进行评论,没有打赏标签,最后的标签是第4个,对别人的文章评论最后标签是第五个# driver.find_element(By.XPATH, '/html/body/div[3]/div/main/div[2]/div/div[2]/ul/li[4]').click()driver.find_element(By.XPATH, '/html/body/div[3]/div/main/div[2]/div/div[2]/ul/li[5]').click()time.sleep(1)driver.find_element(By.XPATH, '//*[@id="comment_content"]').send_keys(random.choice(content_lst))time.sleep(1)driver.find_element(By.XPATH, '//*[@id="commentform"]/div[2]/div[3]/div[4]/a/input').click()time.sleep(2)print(f'第{article_num}篇文章:{article_info[1]},三连已完成')

http://www.xdnf.cn/news/11743.html

相关文章:

  • 完整版搭建hadoop集群
  • 有关嵌入式、单片机、51单片机、STM32、的一些概念详解
  • 解决系统缺少找不到zipfldr.dll文件的问题
  • 酷盘 文件服务器,酷盘是什么 酷盘怎么使用【使用方法】
  • 五款免费pdf转换成word软件
  • python交易是什么意思_py交易什么梗?起源及流行原因
  • HTML简洁大气带进度条的URL跳转页面源码
  • 应用开放平台 (Open Platform)
  • dedecms织梦去除版权powered by dedecms方法
  • 解决NVIDIA软件或驱动安装包出错
  • 安卓手机启动广告让人心烦?学会这招自动跳过APP广告
  • 三分钟了解TMS系统和WMS仓库管理软件,教你玩转供应链
  • c#winform使用WebBrowser 大全
  • Windows 8 32位简体中文 消费者预览版(Consumer Preview) 安装体验全过程
  • MBR、主分区、扩展分区、逻辑分区、活动分区、系统分区、启动分区讲解
  • 科普:生成式 AI 简介
  • 汉字编码问题
  • C++面向对象(一):面向对象程序设计概述
  • IntelliJ IDEA 20162017注册方法和注册码
  • 大公司如何做 APP:背后的开发流程和技术
  • ui自动化测试
  • 网络爬虫——Jsoup解析HTML
  • 手把手教你安装VMware Workstation虚拟机
  • CSS3——transition过渡效果
  • endnote国内杂志文献格式_【科研工具42】Endnote 软件使用教程
  • 网络爬虫是什么
  • SD卡分区及取消分区
  • MicroPython之TPYBoard v102开发板控制OLED显示中文
  • 对象名'***'无效?
  • VS2008操作技巧(不断更新)