当前位置: 首页 > ai >正文

使用Python自动化Word文档处理:段落样式复制、表格转换与模板生成

一、引言

随着办公自动化需求的增长,Python通过python-docx库实现了对Word文档的深度操作。本文将展示如何通过代码实现段落样式复制、HTML表格转Word表格以及动态生成可定制化模板的功能。


二、核心功能模块解析

1. 段落样式与图片复制

def copy_inline_shapes(new_doc, img):"""复制段落中的所有内嵌形状(通常是图片)"""new_para = new_doc.add_paragraph()for image_bytes, w, h in img:# 添加图片到新段落new_para.add_run().add_picture(io.BytesIO(image_bytes), width=w, height=h)  # 设置宽度为1.25英寸或其他合适的值
  • 功能说明:从旧文档中提取图片并复制至新文档,支持自定义宽度和高度。
  • 使用场景:适用于需要保留原始格式的图文混排文档。

2. HTML表格转Word表格

def docx_table_to_html(word_table):# 实现HTML表单转换逻辑,包括合并单元格处理
  • 功能说明:将解析后的HTML表格结构转换为Word文档中的表格,支持横向/纵向合并。
  • 关键点
    • 使用BeautifulSoup解析HTML
    • 处理单元格样式、边框和背景颜色
    • 支持多级标题的样式继承

3. 模板生成与样式动态化

def generate_template():doc = Document()for align in [WD_ALIGN_PARAGRAPH.LEFT, WD_ALIGN_PARAGRAPH.RIGHT, WD_ALIGN_PARAGRAPH.CENTER, None]:for blod_flag in [True, False]:# 创建不同样式的段落
  • 功能说明:动态生成包含多种样式(左、右、居中、无)的模板文档。
  • 优势:支持快速扩展新样式,适应不同场景需求。

三、完整示例代码

示例1:复制段落样式与图片

def clone_document(old_s, old_p, old_ws, new_doc_path):new_doc = Document()for para in old_p:if "Image_None" in para:copy_inline_shapes(new_doc, [i["image"] for i in old_s if len(i) > 3][0])elif "table" in para:html_table_to_docx(new_doc, para)else:clone_paragraph(para)

示例2:HTML表格转Word

def html_table_to_docx(doc, html_content):soup = BeautifulSoup(html_content, 'html.parser')tables = soup.find_all('table')for table in tables:# 处理合并单元格和样式转换逻辑...

四、关键实现细节

1. 样式复制策略

  • 继承机制:通过run_stylestyle字段传递字体、对齐等属性。
  • 分页符处理:使用is_page_break判断段落或表格后是否需要换页。

2. 表格转换优化

  • 合并单元格检测:通过tcPr元素识别横向/纵向合并。
  • 样式迁移:保留边框、背景色等视觉属性。

3. 模板动态生成

  • 多样式支持:通过遍历所有段落样式,生成可扩展的模板。
  • 灵活配置:允许用户自定义分页符位置和样式参数。

五、应用场景

场景解决方案
段落排版自动复制样式并保留格式
数据表导出HTML转Word表格,支持合并单元格
报告模板生成动态创建包含多种样式的模板文件

六、总结

通过python-docx库,我们实现了从样式复制到表格转换的完整流程。动态生成的模板功能进一步提升了文档处理的灵活性。无论是处理复杂的图文排版,还是需要快速生成多风格文档的需求,这套解决方案都能提供高效的实现路径。

建议:在实际应用中,可结合python-docxDocument对象特性,通过遍历所有元素实现更精细的控制。同时,对异常情况的捕获(如图片格式错误)也是提升健壮性的重要部分。

使用模版样式生成文档

from docx import Document
from docx.oxml import OxmlElement
from docx.oxml.shared import qn
from wan_neng_copy_word import clone_document as get_para_style,html_table_to_docx
import io# 剩余部分保持不变...def copy_inline_shapes(new_doc, img):"""复制段落中的所有内嵌形状(通常是图片)"""new_para = new_doc.add_paragraph()for image_bytes, w, h in img:# 添加图片到新段落new_para.add_run().add_picture(io.BytesIO(image_bytes), width=w, height=h)  # 设置宽度为1.25英寸或其他合适的值def copy_paragraph_style(run_from, run_to):"""复制 run 的样式"""run_to.bold = run_from.boldrun_to.italic = run_from.italicrun_to.underline = run_from.underlinerun_to.font.size = run_from.font.sizerun_to.font.color.rgb = run_from.font.color.rgbrun_to.font.name = run_from.font.namerun_to.font.all_caps = run_from.font.all_capsrun_to.font.strike = run_from.font.strikerun_to.font.shadow = run_from.font.shadowdef is_page_break(element):"""判断元素是否为分页符(段落或表格后)"""if element.tag.endswith('p'):for child in element:if child.tag.endswith('br') and child.get(qn('type')) == 'page':return Trueelif element.tag.endswith('tbl'):# 表格后可能有分页符(通过下一个元素判断)if element.getnext() is not None:next_element = element.getnext()if next_element.tag.endswith('p'):for child in next_element:if child.tag.endswith('br') and child.get(qn('type')) == 'page':return Truereturn Falsedef clone_paragraph(para_style, text, new_doc, para_style_ws):"""根据旧段落创建新段落"""new_para = new_doc.add_paragraph()para_style_ws = list(para_style_ws["style"].values())[0]para_style_data = list(para_style["style"].values())[0]para_style_ws.font.size = para_style_data.font.sizenew_para.style = para_style_wsnew_run = new_para.add_run(text)copy_paragraph_style(para_style["run_style"][0], new_run)new_para.alignment = list(para_style["alignment"].values())[0]return new_paradef copy_cell_borders(old_cell, new_cell):"""复制单元格的边框样式"""old_tc = old_cell._tcnew_tc = new_cell._tcold_borders = old_tc.xpath('.//w:tcBorders')if old_borders:old_border = old_borders[0]new_border = OxmlElement('w:tcBorders')border_types = ['top', 'left', 'bottom', 'right', 'insideH', 'insideV']for border_type in border_types:old_element = old_border.find(f'.//w:{border_type}', namespaces={'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'})if old_element is not None:new_element = OxmlElement(f'w:{border_type}')for attr, value in old_element.attrib.items():new_element.set(attr, value)new_border.append(new_element)tc_pr = new_tc.get_or_add_tcPr()tc_pr.append(new_border)def clone_table(old_table, new_doc):"""根据旧表格创建新表格"""new_table = new_doc.add_table(rows=len(old_table.rows), cols=len(old_table.columns))if old_table.style:new_table.style = old_table.stylefor i, old_row in enumerate(old_table.rows):for j, old_cell in enumerate(old_row.cells):new_cell = new_table.cell(i, j)for paragraph in new_cell.paragraphs:new_cell._element.remove(paragraph._element)for old_paragraph in old_cell.paragraphs:new_paragraph = new_cell.add_paragraph()for old_run in old_paragraph.runs:new_run = new_paragraph.add_run(old_run.text)copy_paragraph_style(old_run, new_run)new_paragraph.alignment = old_paragraph.alignmentcopy_cell_borders(old_cell, new_cell)for i, col in enumerate(old_table.columns):if col.width is not None:new_table.columns[i].width = col.widthreturn new_tabledef clone_document(old_s, old_p, old_ws, new_doc_path):new_doc = Document()# 复制主体内容for para in old_p:for k, v in para.items():if "Image_None" == k:# print()copy_inline_shapes(new_doc, [i["image"] for i in old_s if len(i) > 3][0])elif "table" == k:html_table_to_docx(new_doc,v)else:style = [i for i in old_s if v in list(i["style"].keys()) and "style" in i]style_ws = [i for i in old_ws if v in list(i["style"].keys()) and "style" in i]clone_paragraph(style[0], k, new_doc, style_ws[0])new_doc.save(new_doc_path)# 使用示例
if __name__ == "__main__":body_ws, _ = get_para_style('demo_template.docx')body_s, body_p = get_para_style("南山三防工作专报1.docx")clone_document(body_s, body_p, body_ws, 'cloned_example.docx')

模版样式文本分离

from docx.enum.text import WD_BREAKfrom docx import Document
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.oxml import OxmlElement
from bs4 import BeautifulSoupfrom docx.oxml.ns import qndef docx_table_to_html(word_table):soup = BeautifulSoup(features='html.parser')html_table = soup.new_tag('table', style="border-collapse: collapse;")# 记录哪些单元格已经被合并merged_cells = [[False for _ in range(len(word_table.columns))] for _ in range(len(word_table.rows))]for row_idx, row in enumerate(word_table.rows):html_tr = soup.new_tag('tr')col_idx = 0while col_idx < len(row.cells):cell = row.cells[col_idx]# 如果该单元格已经被合并(被前面的 colspan 或 rowspan 占用),跳过if merged_cells[row_idx][col_idx]:col_idx += 1continue# 跳过纵向合并中被“continue”的单元格v_merge = cell._element.tcPr and cell._element.tcPr.find(qn('w:vMerge'))if v_merge is not None and v_merge.get(qn('w:val')) == 'continue':col_idx += 1continuetd = soup.new_tag('td')# 设置文本内容td.string = cell.text.strip()# 初始化样式字符串td_style = ''# 获取单元格样式if cell._element.tcPr:tc_pr = cell._element.tcPr# 处理背景颜色shd = tc_pr.find(qn('w:shd'))if shd is not None:bg_color = shd.get(qn('w:fill'))if bg_color:td_style += f'background-color:#{bg_color};'# 处理对齐方式jc = tc_pr.find(qn('w:jc'))if jc is not None:align = jc.get(qn('w:val'))if align == 'center':td_style += 'text-align:center;'elif align == 'right':td_style += 'text-align:right;'else:td_style += 'text-align:left;'# 处理边框borders = tc_pr.find(qn('w:tcBorders'))if borders is not None:for border_type in ['top', 'left', 'bottom', 'right']:border = borders.find(qn(f'w:{border_type}'))if border is not None:color = border.get(qn('w:color'), '000000')size = int(border.get(qn('w:sz'), '4'))  # 半点单位,1pt = 2szstyle = border.get(qn('w:val'), 'single')td_style += f'border-{border_type}:{size // 2}px {style} #{color};'# 处理横向合并(colspan)grid_span = tc_pr.find(qn('w:gridSpan'))if grid_span is not None:colspan = int(grid_span.get(qn('w:val'), '1'))if colspan > 1:td['colspan'] = colspan# 标记后面被合并的单元格for c in range(col_idx + 1, col_idx + colspan):if c < len(row.cells):merged_cells[row_idx][c] = True# 处理纵向合并(rowspan)v_merge = tc_pr.find(qn('w:vMerge'))if v_merge is not None and v_merge.get(qn('w:val')) != 'continue':rowspan = 1next_row_idx = row_idx + 1while next_row_idx < len(word_table.rows):next_cell = word_table.rows[next_row_idx].cells[col_idx]next_v_merge = next_cell._element.tcPr and next_cell._element.tcPr.find(qn('w:vMerge'))if next_v_merge is not None and next_v_merge.get(qn('w:val')) == 'continue':rowspan += 1next_row_idx += 1else:breakif rowspan > 1:td['rowspan'] = rowspan# 标记后面被合并的行for r in range(row_idx + 1, row_idx + rowspan):if r < len(word_table.rows):merged_cells[r][col_idx] = True# 设置样式和默认边距td['style'] = td_style + "padding: 5px;"html_tr.append(td)# 更新列索引if 'colspan' in td.attrs:col_idx += int(td['colspan'])else:col_idx += 1html_table.append(html_tr)soup.append(html_table)return str(soup)def set_cell_background(cell, color_hex):"""设置单元格背景色"""color_hex = color_hex.lstrip('#')shading_elm = OxmlElement('w:shd')shading_elm.set(qn('w:fill'), color_hex)cell._tc.get_or_add_tcPr().append(shading_elm)def html_table_to_docx(doc, html_content):"""将 HTML 中的表格转换为 Word 文档中的表格:param html_content: HTML 字符串:param doc: python-docx Document 实例"""soup = BeautifulSoup(html_content, 'html.parser')tables = soup.find_all('table')for html_table in tables:# 获取表格行数trs = html_table.find_all('tr')rows = len(trs)# 估算最大列数(考虑 colspan)cols = 0for tr in trs:col_count = 0for cell in tr.find_all(['td', 'th']):col_count += int(cell.get('colspan', 1))cols = max(cols, col_count)# 创建 Word 表格table = doc.add_table(rows=rows, cols=cols)table.style = 'Table Grid'# 记录已处理的单元格(用于处理合并)used_cells = [[False for _ in range(cols)] for _ in range(rows)]for row_idx, tr in enumerate(trs):cells = tr.find_all(['td', 'th'])col_idx = 0for cell in cells:while col_idx < cols and used_cells[row_idx][col_idx]:col_idx += 1if col_idx >= cols:break  # 避免越界# 获取 colspan 和 rowspancolspan = int(cell.get('colspan', 1))rowspan = int(cell.get('rowspan', 1))# 获取文本内容text = cell.get_text(strip=True)# 获取对齐方式align = cell.get('align')align_map = {'left': WD_ALIGN_PARAGRAPH.LEFT,'center': WD_ALIGN_PARAGRAPH.CENTER,'right': WD_ALIGN_PARAGRAPH.RIGHT}# 获取背景颜色style = cell.get('style', '')bg_color = Nonefor s in style.split(';'):if 'background-color' in s or 'background' in s:bg_color = s.split(':')[1].strip()break# 获取 Word 单元格word_cell = table.cell(row_idx, col_idx)# 合并单元格if colspan > 1 or rowspan > 1:end_row = min(row_idx + rowspan - 1, rows - 1)end_col = min(col_idx + colspan - 1, cols - 1)merged_cell = table.cell(row_idx, col_idx).merge(table.cell(end_row, end_col))word_cell = merged_cell# 设置文本内容para = word_cell.paragraphs[0]para.text = text# 设置对齐方式if align in align_map:para.alignment = align_map[align]# 设置背景颜色if bg_color:try:set_cell_background(word_cell, bg_color)except:pass  # 忽略无效颜色格式# 标记已使用的单元格for r in range(row_idx, min(row_idx + rowspan, rows)):for c in range(col_idx, min(col_idx + colspan, cols)):used_cells[r][c] = True# 移动到下一个可用列col_idx += colspan# 添加空段落分隔doc.add_paragraph()return docdef copy_inline_shapes(old_paragraph):"""复制段落中的所有内嵌形状(通常是图片)"""images = []for shape in old_paragraph._element.xpath('.//w:drawing'):blip = shape.find('.//a:blip', namespaces={'a': 'http://schemas.openxmlformats.org/drawingml/2006/main'})if blip is not None:rId = blip.attrib['{http://schemas.openxmlformats.org/officeDocument/2006/relationships}embed']image_part = old_paragraph.part.related_parts[rId]image_bytes = image_part.image.blobimages.append([image_bytes, image_part.image.width, image_part.image.height])return imagesdef is_page_break(element):"""判断元素是否为分页符(段落或表格后)"""if element.tag.endswith('p'):for child in element:if child.tag.endswith('br') and child.get(qn('type')) == 'page':return Trueelif element.tag.endswith('tbl'):# 表格后可能有分页符(通过下一个元素判断)if element.getnext() is not None:next_element = element.getnext()if next_element.tag.endswith('p'):for child in next_element:if child.tag.endswith('br') and child.get(qn('type')) == 'page':return Truereturn Falsedef clone_paragraph(old_para):"""根据旧段落创建新段落"""style = {"run_style": []}if old_para.style:# 这里保存style  主要通过字体识别   是 几级标题style_name_to_style_obj = {old_para.style.name + "_" + str(old_para.alignment).split()[0]: old_para.style}style["style"] = style_name_to_style_objparas = []for old_run in old_para.runs:text_to_style_name = {old_run.text: old_para.style.name + "_" + str(old_para.alignment).split()[0]}style["run_style"].append(old_run)paras.append(text_to_style_name)style_name_to_alignment = {old_para.style.name + "_" + str(old_para.alignment).split()[0]: old_para.alignment}style["alignment"] = style_name_to_alignmentimages = copy_inline_shapes(old_para)if len(images):style["image"] = imagesparas.append({"Image_None": "Image_None"})return style, parasdef clone_document(old_doc_path):try:old_doc = Document(old_doc_path)new_doc = Document()# 复制主体内容elements = old_doc.element.bodypara_index = 0table_index = 0index = 0body_style = []body_paras = []while index < len(elements):element = elements[index]if element.tag.endswith('p'):old_para = old_doc.paragraphs[para_index]style, paras = clone_paragraph(old_para)body_style.append(style)body_paras += paraspara_index += 1index += 1elif element.tag.endswith('tbl'):old_table = old_doc.tables[table_index]body_paras += [{"table": docx_table_to_html(old_table)}]table_index += 1index += 1elif element.tag.endswith('br') and element.get(qn('type')) == 'page':if index > 0:body_paras.append("br")new_doc.add_paragraph().add_run().add_break(WD_BREAK.PAGE)index += 1else:index += 1# 检查分页符if index < len(elements) and is_page_break(elements[index]):if index > 0:new_doc.add_paragraph().add_run().add_break(WD_BREAK.PAGE)body_paras.append("br")index += 1else:return body_style, body_parasexcept Exception as e:print(f"复制文档时发生错误:{e}")# 使用示例
if __name__ == "__main__":# 示例HTML表格body_s, body_p = clone_document('专报1.docx')

生成可更改模版

from docx import Document
from docx.enum.text import WD_ALIGN_PARAGRAPH# 创建一个新的Word文档
doc = Document()
for align in [WD_ALIGN_PARAGRAPH.LEFT, WD_ALIGN_PARAGRAPH.RIGHT, WD_ALIGN_PARAGRAPH.CENTER, None]:for blod_flag in [True, False]:# 获取所有可用的段落样式名(只保留段落样式)paragraph_styles = [style for style in doc.styles if style.type == 1  # type == 1 表示段落样式]# 输出样式数量print(f"共找到 {len(paragraph_styles)} 种段落样式:")for style in paragraph_styles:print(f"- {style.name}")# 在文档中添加每个样式对应的段落for style in paragraph_styles:heading = doc.add_paragraph()run = heading.add_run(f"样式名称: {style.name}")run.bold = blod_flagpara = doc.add_paragraph(f"这是一个应用了 '{style.name}' 样式的段落示例。", style=style)para.alignment = align# 添加分隔线(可选)doc.add_paragraph("-" * 40)# 保存为 demo_template.docx
doc.save("demo_template.docx")
print("\n✅ 已生成包含所有段落样式的模板文件:demo_template.docx")
http://www.xdnf.cn/news/8686.html

相关文章:

  • GitLab-CI将项目Wiki自动部署到文档中心
  • 前置过滤器和净水机安哪个?
  • Docker核心技术:Docker原理之Union文件系统
  • UART、RS232、RS485基础知识
  • HTTP 与 HTTPS
  • linux 内核 watchdog 模块学习及实践过程中遇见的问题总结
  • IP、子网掩码、默认网关、DNS
  • 软件工程重点复习(2)
  • 当NLP能模仿人类写作:原创性重构而非终结
  • 《Shell脚本实战:打造交互式多级菜单与LAMP/LNMP环境配置指南》
  • python学习day28
  • 博图SCL星三角降压启动从安装到编程步骤详解
  • libreoffice容器word转pdf
  • Word转PDF--自动生成目录
  • 教师技术知识对人工智能赋能下教学效果的影响:以教学创新为中介的实证研究
  • java每日精进 5.25【Redis缓存】
  • 一文讲透golang channel 的特点、原理及使用场景
  • Linux相关概念和易错知识点(41)(UDP、TCP报头结构)
  • 识别速度快且精准的OCR工具
  • 【短距离通信】【WiFi】WiFi7起源和应用场景介绍
  • 中间件安全IISApacheTomcatNginx弱口令不安全配置CVE
  • 梯度下降 损失景观 视频截图
  • 【 java 基础问题 第一篇 】
  • 【MySQL】第9节|Innodb底层原理与Mysql日志机制深入剖析(二)
  • Audio Codec的I2S时序或PCM时序,代表什么意思
  • 使用Chrome waterfall 查看接口耗时
  • openssl-1.1.1w-win64
  • ISO 26262-5 评估硬件随机失效率
  • redis功能清单
  • 记录一次功能优化需求下的业务处理思路整理