当前位置: 首页 > news >正文

Python 之 selenium 打开浏览器指定端口进行接续操作

一般使用 selenium 进行数据爬取时,常用处理流程是让 selenium 从打开浏览器开始,完成全流程的所有操作。但是有时候,我们希望用户先自己打开浏览器进入指定网页,完成登录认证等一系列操作之后(比如用户、密码、短信验证码及各种难处理的图形验证码之类),再让 selenium 从登录后的页面进行接续操作爬取数据。那么如何才能将前后操作接续起来呢?

常规操作

常规操作一般使用下面的这种方式,设置初始参数后直接使用 get 方法去打开网页。

from selenium import webdriverclass DriverClass:def __init__(self):self.driver = self._init_driver()def _init_driver(self):try:option = webdriver.ChromeOptions()option.add_experimental_option('excludeSwitches', ['enable-automation'])option.add_experimental_option('useAutomationExtension', False)prefs = dict()prefs['credentials_enable_service'] = Falseprefs['profile.password_manager_enable'] = Falseprefs['profile.name'] = "Person 1"option.add_experimental_option('prefs', prefs)option.add_argument('--disable-gpu')option.add_argument("--disable-blink-features=AutomationControlled")option.add_argument('--user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"')option.add_argument('--no-sandbox')option.add_argument('ignore-certificate-errors')driver = webdriver.Chrome(r"./driver/chromedriver.exe", options=option)driver.implicitly_wait(2)driver.maximize_window()return driverexcept Exception as e:raise edef get_driver(self) -> webdriver.Chrome:if isinstance(self.driver, webdriver.Chrome):return self.driverraise Exception('初始化浏览器失败')if __name__ == '__main__':dc = DriverClass()driver = dc.get_driver()print(driver)driver.get("https://www.baidu.com")

接续操作

接续操作主要通过在打开浏览器时,都设置相同的接口来完成前后的衔接(不然 selenium 不知道要从哪个浏览器页面进行接续)。

用户打开浏览器

用户手动打开浏览器时,指定对应的端口(这里设置的是 9527)及数据目录(自己自定义自定一个)。

C:\Program Files\Google\Chrome\Application>chrome.exe --remote-debugging-port=9527 --user-data-dir="E:\lky_project\tmp_project\handle_qcc_data\\chrome_user_data"

执行完上面的命令以后,会打开一个新的浏览器页面。

打开浏览器后,用户可以手动输入相应页面,完成相应的用户登录认证等操作。 

程序接续浏览器

selenium 通过增加下面的配置参数

option.add_experimental_option("debuggerAddress", "127.0.0.1:9527")

来打开并接续处理用户已经打开的指定端口的浏览器。之后,程序就可以通过浏览器句柄去接续处理后续的任务了。

driver_class.py

from selenium import webdriverclass DriverClass:def __init__(self):self.driver = self._init_driver()def _init_driver(self):try:option = webdriver.ChromeOptions()# option.add_experimental_option('excludeSwitches', ['enable-automation'])# option.add_experimental_option('useAutomationExtension', False)# prefs = dict()# prefs['credentials_enable_service'] = False# prefs['profile.password_manager_enable'] = False# prefs['profile.name'] = "Person 1"# option.add_experimental_option('prefs', prefs)option.add_argument('--disable-gpu')option.add_argument("--disable-blink-features=AutomationControlled")option.add_argument('--user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"')option.add_argument('--no-sandbox')option.add_argument('ignore-certificate-errors')option.add_experimental_option("debuggerAddress", "127.0.0.1:9527")driver = webdriver.Chrome(r"./driver/chromedriver.exe", options=option)driver.implicitly_wait(2)# driver.maximize_window()return driverexcept Exception as e:raise edef get_driver(self) -> webdriver.Chrome:if isinstance(self.driver, webdriver.Chrome):return self.driverraise Exception('初始化浏览器失败')if __name__ == '__main__':dc = DriverClass()driver = dc.get_driver()print(driver)# 程序使用接续后的浏览器句柄 driver 完成后续操作

注意事项

注意看,我上面的接续操作函数,有一部分的参数设置是注释掉的。这是因为接续是从已经打开的浏览器接收继续进行操作,有部分的参数在用户打开浏览器的时候就已经设定了,所以不再支持通过接续的方式继续重复设置。

实战示例

比如在手动打开指定 9527 端口的浏览器后,登录企查查进入高级搜索,然后使用程序获取具有相应资质的企业数目(操作太频繁可能触发校验),最后生成结果文件 data.json(中途可能会异常中断,可以做成下面这种断点续查的方式)。

driver_class.py 用上面的就可以。

main.py

import json
import re
import timefrom selenium.webdriver.common.by import By
from driver_class import DriverClassdc = DriverClass()
driver = dc.get_driver()
xpath_prefix = '//div/div/div/div/span[text()="资质证书"]/following-sibling::div'def checkbox_select(element_checkbox):"""复选框选中"""class_attribute = element_checkbox.get_attribute("class")if "checked" not in class_attribute:element_checkbox.find_element(By.XPATH, './span[@class="qccd-tree-checkbox-inner"]').click()def checkbox_unselect(element_checkbox):"""复选框取消选中"""class_attribute = element_checkbox.get_attribute("class")if "checked" in class_attribute:element_checkbox.find_element(By.XPATH, './span[@class="qccd-tree-checkbox-inner"]').click()def get_amount(element_checkbox):"""获取对应复选框对应的企业数目"""checkbox_select(element_checkbox)xpath_confirm = xpath_prefix + '/div/div/div/div/div[text()="确定"]'driver.find_element(By.XPATH, xpath_confirm).click()time.sleep(0.5)try:xpath_result = '//div/div/div[@class="search-btn limit-svip"]'result = str(driver.find_element(By.XPATH, xpath_result).text)except Exception as e:print(f"异常: {str(e)}")result = "0"result = result.replace(",", "")match_object = re.search("(\d+)", result)amount = match_object.group(1)print(f"数目:{amount}")# 清除结果,避免点击选择项时误点击关闭xpath_clear = '//div/div/a[contains(text(), "清除")]'try:driver.find_element(By.XPATH, xpath_clear).click()except:passxpath_select = xpath_prefix + '[@class="trigger-container"]'driver.find_element(By.XPATH, xpath_select).click()time.sleep(0.2)checkbox_unselect(element_checkbox)return amountdef extend_options():"""展开折叠项并获取数据,只展开三层"""# json.dump(data, open("data.json", 'w', encoding="utf-8"), indent=2, ensure_ascii=False)try:data = json.load(open("data.json", encoding="utf-8"))except:data = {}try:xpath_first_class = xpath_prefix + '//div/ul/li[@role="treeitem"]'# xpath_first_class = xpath_prefix + '//div/ul/li/span[contains(@class, "qccd-tree-switcher")]'first_item_list = driver.find_elements(By.XPATH, xpath_first_class)for item_li in first_item_list:text_dk1 = item_li.find_element(By.XPATH, './span/span/div/span[@class="text-dk"]').textdata[text_dk1] = data.get(text_dk1, {})print(f"{text_dk1}")switcher = item_li.find_element(By.XPATH, './span[contains(@class, "qccd-tree-switcher")]')class_attribute = switcher.get_attribute("class")element_checkbox = item_li.find_element(By.XPATH, './span[contains(@class, "checkbox")]')if "close" in class_attribute:switcher.click()time.sleep(0.1)elif "noop" in class_attribute:# 当前节点没有子节点if not data[text_dk1]:amount = get_amount(element_checkbox)data[text_dk1] = amountcontinue# 点开以后,下层级的 ul/li 会展示出来second_item_list = item_li.find_elements(By.XPATH, "./ul/li")for second_item_li in second_item_list:text_dk2 = second_item_li.find_element(By.XPATH, './span/span/div/span[@class="text-dk"]').textdata[text_dk1][text_dk2] = data[text_dk1].get(text_dk2, {})print(f"--{text_dk2}")switcher = second_item_li.find_element(By.XPATH, './span[contains(@class, "qccd-tree-switcher")]')class_attribute = switcher.get_attribute("class")element_checkbox = second_item_li.find_element(By.XPATH, './span[contains(@class, "checkbox")]')if "close" in class_attribute:switcher.click()time.sleep(0.1)elif "noop" in class_attribute:# 当前节点没有子节点if not data[text_dk1][text_dk2]:amount = get_amount(element_checkbox)data[text_dk1][text_dk2] = amountcontinue# 点开以后,下层级的 ul/li 会展示出来third_item_list = second_item_li.find_elements(By.XPATH, "./ul/li")for third_item_li in third_item_list:text_dk3 = third_item_li.find_element(By.XPATH, './span/span/div/span[@class="text-dk"]').textdata[text_dk1][text_dk2][text_dk3] = data[text_dk1][text_dk2].get(text_dk3, {})print(f"----{text_dk3}")switcher = third_item_li.find_element(By.XPATH, './span[contains(@class, "qccd-tree-switcher")]')class_attribute = switcher.get_attribute("class")# 到第三层时,不再展开,直接选择复选框element_checkbox = third_item_li.find_element(By.XPATH, './span[contains(@class, "checkbox")]')if not data[text_dk1][text_dk2][text_dk3]:amount = get_amount(element_checkbox)data[text_dk1][text_dk2][text_dk3] = amountexcept Exception as e:raise efinally:json.dump(data, open("data.json", 'w', encoding="utf-8"), indent=2, ensure_ascii=False)def spider_data():# 尝试关闭资质证书选择框、清除所选项xpath_close = xpath_prefix + '/div/div/div/a[@class="nclose"]'xpath_clear = '//div/div/a[contains(text(), "清除")]'try:driver.find_element(By.XPATH, xpath_close).click()except:passtry:driver.find_element(By.XPATH, xpath_clear).click()except:pass# 点击资质证书选择框xpath_select = xpath_prefix + '[@class="trigger-container"]'driver.find_element(By.XPATH, xpath_select).click()time.sleep(2)extend_options()# 取消按钮xpath_cancel = xpath_prefix + '/div/div/div/div/div[text()="取消"]'# 确定按钮xpath_confirm = xpath_prefix + '/div/div/div/div/div[text()="确定"]'driver.find_element(By.XPATH, xpath_confirm).click()if __name__ == '__main__':spider_data()

最后可以得到生成的 data.json 文件如下:

{"建筑业资质": {"工程设计资质证书": {"工程设计专项资质": "26329","建筑工程设计事务所": "356","工程设计行业资质": "4487","工程设计专业资质": "19902","工程设计综合资质": "98"},"工程勘察资质证书": {"工程勘察综合资质": "377","工程勘察专业资质": "7464","工程勘察劳务资质": "3019"},
...},"食品农产品认证": {"有机产品(OGA)": "49868","良好农业规范(GAP)": "6449","食品质量认证(酒类)": "151","绿色食品认证": "34723","绿色市场认证": "318","无公害农产品": "31067","食品安全管理体系认证": "72075","危害分析与关键控制点认证": "51844","乳制品生产企业良好生产规范认证": "445","乳制品生产企业危害分析与关键控制点(HACCP)体系认证": "570","饲料产品": "85"},"其他资质": {"办学许可证": "192010","代理记账许可证书": "34588","会计师事务所执业证书": "12252","DOC证书": "982","SMC证书": "1886","名特优新农产品证书": "1818","招投标类综合资质": "36317","区块链信息服务备案": "2765","医疗机构执业许可证": "570877","CCC工厂认证": "16154","卫生许可证": "3244"}
}
http://www.xdnf.cn/news/446221.html

相关文章:

  • Nginx+Lua 实战避坑:从模块加载失败到版本冲突的深度剖析
  • 数字信号处理-大实验1.1
  • Vue3吸顶导航的实现
  • Jmeter变量传递介绍
  • JavaScript 中级进阶技巧之map函数
  • 哈希表的实现01
  • java每日精进 5.14【参数校验】
  • qml中定时器的用法
  • 操作系统期末复习笔记
  • WHAT - 前端开发滚动场景API罗列
  • Web UI测试效率低?来试Parasoft Selenic的智能修复与分析!
  • 从 “学会学习” 到高效适应:元学习技术深度解析与应用实践
  • 常见 RPC 协议类别对比
  • 《Effective Python》第2章 字符串和切片操作——深入理解 Python 中 __repr__ 与 __str__
  • 行业趋势与技术创新:驾驭工业元宇宙与绿色智能制造
  • 【氮化镓】AlGaN合金中成分相关的辐射响应
  • 最短路和拓扑排序知识点
  • 各省网上零售额数据(2015-2022年)-社科数据
  • C++之fmt库介绍和使用(1)
  • TCP/IP-——C++编程详解
  • 【windows server脚本每天从网络盘复制到本地】
  • C 语言学习笔记(8)
  • 【3Ds Max】.ive格式文件的导出与加载
  • Oracle数据库中,WITH..AS 子句用法解析
  • 解读红黑树:揭晓高效数据结构的核心引擎
  • 精益数据分析(58/126):移情阶段的深度实践与客户访谈方法论
  • 全面解析 Server-Sent Events(SSE)协议:从大模型流式输出到实时通信场景
  • Spring MVC数据绑定和响应 你了解多少?
  • 如何下载和安装 Ghost Spectre Windows 11 24H2 PRO
  • 102. 二叉树的层序遍历递归法:深度优先搜索的巧妙应用