利用pyxlsbwriter包实现写入xlsb和xlsx格式及读取效率的比较
calamine支持读入xlsb格式,但大文件用xlsb保存的例子很少,pyxlsbwriter是pypi.org上新上架的一个包,它能写入xlsb和xlsx格式文件。
于是让deepseek将它的例子改写成支持dataFrame, 然后就可以靠它打通duckdb, 用duckdb的查询结果写入文件了。
from pyxlsbwriter import XlsbWriter
import pandas as pd
from datetime import datetime
import duckdbdef df_to_xlsb(df, output_file, sheet_name="Sheet1", hidden=False, compression_level=6):"""将 Pandas DataFrame 写入 XLSB 文件参数:df (pd.DataFrame): 要写入的DataFrameoutput_file (str): 输出文件名sheet_name (str): 工作表名称,默认为"Sheet1"hidden (bool): 是否隐藏工作表,默认为Falsecompression_level (int): 压缩级别(0-9),默认为6"""# 将DataFrame转换为二维列表格式data = [df.columns.tolist()] # 添加列名作为首行data.extend(df.values.tolist()) # 添加数据行# 写入XLSB文件with XlsbWriter(output_file, compressionLevel=compression_level) as writer:writer.add_sheet(sheet_name, hidden=hidden)writer.write_sheet(data)# 示例用法
if __name__ == "__main__":sss="""select "Unique Key","Created Date", "Closed Date", "Latitude", "Longitude" from 'NYC_311_SR_2010-2020-sample-1M.csv'"""ss2="""select * from '5m Sales Records.csv'"""df=duckdb.sql(ss2).df()print(df)# 写入XLSB文件df_to_xlsb(df=df,output_file="duckdbdf5.xlsb",sheet_name="Employees",hidden=False,compression_level=6)print("DataFrame successfully written to output.xlsb")
只要把上述代码中的
# 写入XLSB文件with XlsbWriter(output_file, compressionLevel=compression_level) as writer:
改为
# 写入XLSX文件with XlsxWriter(output_file, compressionLevel=compression_level) as writer:
同时在程序开头添加from pyxlsbwriter import XlsxWriter
,其他代码原封不动就能实现写入xlsx文件,非常方便。
以下分别是写入100万行和500万行的结果。
[1000000 rows x 5 columns]
DataFrame successfully written to duckdbdf.xlsbKernel Time = 0.468 = 4%
User Time = 10.781 = 104%
Process Time = 11.250 = 108% Virtual Memory = 1076 MB
Global Time = 10.330 = 100% Physical Memory = 624 MB[1000000 rows x 5 columns]
DataFrame successfully written to duckdbdf.xlsxKernel Time = 0.500 = 4%
User Time = 11.703 = 104%
Process Time = 12.203 = 108% Virtual Memory = 1559 MB
Global Time = 11.243 = 100% Physical Memory = 1033 MB[5000000 rows x 14 columns]
DataFrame successfully written to duckdbdf5.xlsxKernel Time = 5.546 = 6%
User Time = 85.000 = 97%
Process Time = 90.546 = 103% Virtual Memory = 10870 MB
Global Time = 87.103 = 100% Physical Memory = 7467 MB[5000000 rows x 14 columns]
DataFrame successfully written to duckdbdf5.xlsbKernel Time = 1.828 = 2%
User Time = 81.046 = 102%
Process Time = 82.875 = 104% Virtual Memory = 5878 MB
Global Time = 79.040 = 100% Physical Memory = 5415 MB
时间差不多,xlsb略快,但是写xlsb节省内存比较多, 查看文件大小,居然xlsx的文件更小,两个数据集都如此,这和网上的说法不一样。
2025/08/17 11:36 225,840,081 duckdbdf5.xlsx
2025/08/17 11:38 233,894,167 duckdbdf5.xlsb2025/08/17 11:25 28,704,798 duckdbdf.xlsx
2025/08/17 11:20 32,590,397 duckdbdf.xlsb
其实xlsb外壳也是zip,只是其中保存的.xml变成了.bin格式,可能zip方便压缩文本,不方便压缩二进制。
再看rusty_sheet对两种格式读取的效果对比,官方excel插件因不支持xlsb格式,无法测试。
D create table xlsx as from read_sheet('duckdbdf5.xlsx',header=1);Run Time (s): real 70.917 user 61.187500 sys 8.734375
D create table xlsb as from read_sheet('duckdbdf5.xlsb',header=1);Run Time (s): real 7.639 user 7.312500 sys 0.281250
看起来很好,读取xlsb只用了xlsx十分之一的时间。但是查询行数,xlsb少了很多,是catamine读取xlsb文件时丢数据?还是xlsb规范就支持那么多行?再用wps打开,两种后缀都是只显示1,048,576行,在微软网站看到xlsx格式单表最多1,048,576行,那为什么catamine却能从xlsx文件读出500万行, 需要进一步研究。
D select count(*) from xlsx;
┌────────────────┐
│ count_star() │
│ int64 │
├────────────────┤
│ 5000000 │
│ (5.00 million) │
└────────────────┘
Run Time (s): real 0.007 user 0.000000 sys 0.000000
D select count(*) from xlsb;
┌────────────────┐
│ count_star() │
│ int64 │
├────────────────┤
│ 1048576 │
│ (1.05 million) │
└────────────────┘
Run Time (s): real 0.007 user 0.000000 sys 0.000000
变通的方法是加sheet,就能存更多行,然后用read_sheet指定xlsx的sheet,再把结果union all就行了。如果能保持这个比率,总体算下来还是比读入单sheet的xlsx文件快。
实际测试一下
D create table xlsb as from read_sheet('duckdbdf5.xlsb',header=1,sheet_name='Sheet1')
union all from read_sheet('duckdbdf5.xlsb',header=1,sheet_name='Sheet2')
union all from read_sheet('duckdbdf5.xlsb',header=1,sheet_name='Sheet3')
union all from read_sheet('duckdbdf5.xlsb',header=1,sheet_name='Sheet4')
union all from read_sheet('duckdbdf5.xlsb',header=1,sheet_name='Sheet5');
100% ▕████████████████████████████████████████████████████████████▏
Run Time (s): real 33.616 user 31.484375 sys 2.000000
D select count(*) from xlsb;
┌────────────────┐
│ count_star() │
│ int64 │
├────────────────┤
│ 5000000 │
│ (5.00 million) │
└────────────────┘
Run Time (s): real 0.007 user 0.000000 sys 0.000000
执行时间与预想的一致,所以xlsb格式确实能提高读取效率。
以下是分Sheet写入XLSB文件deepseek代码
from pyxlsbwriter import XlsbWriter
import pandas as pd
from datetime import datetime
import math
import duckdbdef df_to_xlsb(df, output_file, sheet_prefix="Sheet", hidden=False, compression_level=6,max_rows_per_sheet=1048575 # XLSB单Sheet最大行数
):"""将大型Pandas DataFrame分Sheet写入XLSB文件参数:df (pd.DataFrame): 要写入的DataFrameoutput_file (str): 输出文件名sheet_prefix (str): 工作表名称前缀,默认为"Sheet"hidden (bool): 是否隐藏工作表,默认为Falsecompression_level (int): 压缩级别(0-9),默认为6max_rows_per_sheet (int): 每个Sheet最大行数,默认为1,048,576"""# 计算需要的Sheet数量total_rows = len(df)num_sheets = math.ceil(total_rows / max_rows_per_sheet)# 生成Sheet名称列表sheet_names = [f"{sheet_prefix}{i+1}" for i in range(num_sheets)]with XlsbWriter(output_file, compressionLevel=compression_level) as writer:for i, sheet_name in enumerate(sheet_names):# 计算当前Sheet的行范围start_row = i * max_rows_per_sheetend_row = min((i + 1) * max_rows_per_sheet, total_rows)# 获取当前Sheet的数据块chunk = df.iloc[start_row:end_row]# 准备写入数据(列名+数据)data = [chunk.columns.tolist()] # 添加列名作为首行data.extend(chunk.values.tolist()) # 添加数据行# 创建并写入Sheetwriter.add_sheet(sheet_name, hidden=hidden)writer.write_sheet(data)print(f"写入Sheet: {sheet_name}, 行数: {len(chunk)}")# 示例用法
if __name__ == "__main__":ss2="""select * from '5m Sales Records.csv'"""df=duckdb.sql(ss2).df()# 写入XLSB文件(自动分Sheet)df_to_xlsb(df=df,output_file="large_data.xlsb",sheet_prefix="Sheet",hidden=False,compression_level=6)print("大型DataFrame已成功分Sheet写入XLSB文件")
它的执行时间与不分sheet基本一致
timer64 python xlsbsheets.py
写入Sheet: Sheet1, 行数: 1048575
写入Sheet: Sheet2, 行数: 1048575
写入Sheet: Sheet3, 行数: 1048575
写入Sheet: Sheet4, 行数: 1048575
写入Sheet: Sheet5, 行数: 805700
大型DataFrame已成功分Sheet写入XLSB文件Kernel Time = 2.218 = 2%
User Time = 79.515 = 102%
Process Time = 81.734 = 105% Virtual Memory = 5662 MB
Global Time = 77.788 = 100% Physical Memory = 5198 MB