当前位置：首页 > news >正文

数据提取之bs4（BeautifuSoup4）模块与Css选择器

news 2025/7/12 20:13:22

from bs4 import BeautifulSoup

创建对象 <class 'bs4.BeautifulSoup'>

soup = BeautifulSoup(源码, '解析器')

bs4标签种类

（1）tag: 标签
print(soup.title, type(soup.title))
（2）获取标签里面的文本内容, 可导航的字符串，数据类型是<class 'bs4.element.NavigableString'>对象，可以使用字符串的方法
title = soup.title
# string
print(title.string, type(title.string))
（3）注释
# 注释 <class 'bs4.element.Comment'>
html = '<b></b>'
soup2 = BeautifulSoup(html, 'lxml')

遍历文档树

# 解析数据
head_tag = soup.p #默认获取第一个p标签
# 获取标签的子节点, .contents: 返回的是一个所有子节点的列表
# print(head_tag.contents)

print(head_tag.children) # 返回的是一个生成器对象，通过循环遍历取值
for head in head_tag.children:
print(head)

源码：

# 1. 导入模块
from bs4 import BeautifulSoup# 源码
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p><p class="story">...</p>
"""# 2. 创建对象  <class 'bs4.BeautifulSoup'>
soup = BeautifulSoup(html_doc, 'lxml')# 3. 解析数据
head_tag = soup.p
# 获取标签的子节点, .contents: 返回的是一个所有子节点的列表
# print(head_tag.contents)
print(head_tag.children)  # 返回的是一个生成器对象，通过循环遍历
for head in head_tag.children:print(head)

获取节点文本内容

# 通过上一级标签，去获取子级的标签文本内容
# head = soup.head
# print(head.string)

# print(head.text) # 获取的是多个子级标签的文本内容，内容都拼接在一起

# strings/stripped_strings
contents = soup.html
# print(contents.string) # 没有获取
# print(contents.text)
# print(contents.strings) # <generator object Tag._all_strings at 0x000001E214912820> 生成器对象
# strings可以获取这个标签下的所有的文本，文本内容包含很多空行
# for data in contents.strings:
# print(data)

# stripped_strings可以获取这个标签下的所有的文本，去除了多空行
# for data in contents.stripped_strings:
# print(data)

总结：

获取标签文本内容
string: 标签里面只有一个标签有文本内容，可导航的字符串
text: 将所有的标签文本内容拼接在一起
strings: 依次获取所有的标签文本内容，包含空行，返回的是一个生成器对象
stripped_strings: 依次获取所有的标签文本内容，去除多余的空行看，返回的是一个生成器对象

查看全文

http://www.xdnf.cn/news/1096453.html