当前位置: 首页 > news >正文

calamine读取xlsx文件的方法比较

calamine是纯rust编写的电子表格文件读取库,支持Excel(xls, xlsx, xlsm, xlsb, xla, xlam)、opendocument电子表格(ods)。
它自带的示例程序excel_to_csv.rs采用workbook.worksheet_range来读取,很多AI和网上的介绍文章都说,处理大文件workbook.worksheet_range_ref更快,我实际测试了一下。

测试步骤
0.准备测试xlsx文件
在duckdb中生成tpch数据,用excel插件写入xlsx文件。

duckdb
load tpch;
load excel;
call dbgen(sf=1);
copy (from lineitem limit 1048575)to 'exli2.xlsx' WITH (FORMAT xlsx, HEADER true) ;

1.拉取calamine源码

git clone --depth=1 https://gitclone.com/github.com/tafia/calamine
Cloning into 'calamine'...
remote: 对象计数中: 133, 完成.
remote: 压缩对象中: 100% (109/109), 完成.
remote: Total 133 (delta 21), reused 133 (delta 21)
Receiving objects: 100% (133/133), 5.08 MiB | 48.00 KiB/s, done.
Resolving deltas: 100% (21/21), done.

2.编译执行

export PATH=/par/mold240/bin:$PATHcd /par/calamine
cargo build --release --example excel_to_csvtime /par/calamine/target/release/examples/excel_to_csv /par/exli2.xlsx Sheet1real	0m21.959s
user	0m19.332s
sys	0m2.180s

3.修改excel_to_csv.rs为excel_to_csv_ref.rs
基本上就是把函数worksheet_range改为worksheet_range_ref,加入use calamine::ReaderRef;和use calamine::DataRef;数据类型Data修改为DataRef<'>,再根据编译器提示加入DataRef::<'>::SharedString的处理。

// SPDX-License-Identifier: MIT
//
// Copyright 2016-2025, Johann Tuffe.//! An example for using the `calamine` crate to convert an Excel file to CSV.
//!
//! Converts XLSX, XLSM, XLSB, and XLS files. The filename and sheet name must
//! be specified as command line arguments. The output CSV will be written to a
//! file with the same name as the input file, but with a `.csv` extension.use std::env;
use std::fs::File;
use std::io::{BufWriter, Write};
use std::path::PathBuf;use calamine::{open_workbook_auto, Data, Range, Reader};
use calamine::ReaderRef;
use calamine::DataRef;
// usage: cargo run --example excel_to_csv file.xls[xmb] sheet_name
//
// Where:
// - `file.xls[xmb]` is the Excel file to convert. Required.
// - `sheet_name` is the name of the sheet to convert. Required.
//
// The output will be written to a file with the same name as the input file,
// including the path, but with a `.csv` extension.
//
fn main() {let excel_file = env::args().nth(1).expect("Please provide an excel file to convert");let sheet_name = env::args().nth(2).expect("Expecting a sheet name as second argument");let excel_path = PathBuf::from(excel_file);match excel_path.extension().and_then(|s| s.to_str()) {Some("xlsx") | Some("xlsm") | Some("xlsb") | Some("xls") => (),_ => panic!("Expecting an excel file"),}let csv_path = excel_path.with_extension("csv");let mut csv_file = BufWriter::new(File::create(csv_path).unwrap());let mut workbook = open_workbook_auto(&excel_path).unwrap();let range = workbook.worksheet_range_ref(&sheet_name).unwrap();write_to_csv(&mut csv_file, &range).unwrap();
}// Write the Excel data as strings to a CSV file. Uses a semicolon (`;`) as the
// field separator.
//
// Note, this is a simplified version of CSV and doesn't handle quoting of
// separators or other special cases. See the `csv.rs` crate for a more robust
// solution.
fn write_to_csv<W: Write>(output_file: &mut W, range: &Range<DataRef<'_>>) -> std::io::Result<()> {let max_column = range.get_size().1 - 1;for rows in range.rows() {for (col_number, cell_data) in rows.iter().enumerate() {match *cell_data {DataRef::<'_>::Empty => Ok(()),DataRef::<'_>::Int(ref i) => write!(output_file, "{i}"),DataRef::<'_>::Bool(ref b) => write!(output_file, "{b}"),DataRef::<'_>::Error(ref e) => write!(output_file, "{e:?}"),DataRef::<'_>::Float(ref f) => write!(output_file, "{f}"),DataRef::<'_>::DateTime(ref d) => write!(output_file, "{}", d.as_f64()),DataRef::<'_>::String(ref s) | DataRef::<'_>::DateTimeIso(ref s) | DataRef::<'_>::DurationIso(ref s) => {write!(output_file, "{s}")} , DataRef::<'_>::SharedString(s) => write!(output_file, "{s}")}?;// Write the field separator except for the last column.if col_number != max_column {write!(output_file, ";")?;}}write!(output_file, "\r\n")?;}Ok(())
}

4.编译执行

cargo build --release --example excel_to_csv_refmv /par/exli2.csv /par/exli2.csv.old
time /par/calamine/target/release/examples/excel_to_csv_ref /par/exli2.xlsx Sheet1real	0m22.062s
user	0m19.412s
sys	0m2.592s
md5sum /par/exli2.csv*
05bcc52a09f1ee2bbb7c9b08172fb7ae  /par/exli2.csv
05bcc52a09f1ee2bbb7c9b08172fb7ae  /par/exli2.csv.old

结论:worksheet_range和worksheet_range_ref读取速度和读取结果完全一样。
polars和pandas的read_excel函数都支持使用calamine引擎,polars目前还是默认使用calamine引擎,测试结果如下

import polars as plpl.read_excel(source="exli2.xlsx",sheet_name="Sheet1",) 
shape: (1_048_575, 16)
┌────────────┬───────────┬───────────┬──────────────┬───┬───────────────┬───────────────────┬────────────┬─────────────────────────────────┐
│ l_orderkey ┆ l_partkey ┆ l_suppkey ┆ l_linenumber ┆ … ┆ l_receiptdate ┆ l_shipinstruct    ┆ l_shipmode ┆ l_comment                       │
│ ---        ┆ ---       ┆ ---       ┆ ---          ┆   ┆ ---           ┆ ---               ┆ ---        ┆ ---                             │
│ i64        ┆ i64       ┆ i64       ┆ i64          ┆   ┆ date          ┆ str               ┆ str        ┆ str                             │
╞════════════╪═══════════╪═══════════╪══════════════╪═══╪═══════════════╪═══════════════════╪════════════╪═════════════════════════════════╡
│ 115519077061            ┆ … ┆ 1996-03-22    ┆ DELIVER IN PERSON ┆ TRUCK      ┆ to beans x-ray carefull         │
│ 16731073112            ┆ … ┆ 1996-04-20    ┆ TAKE BACK RETURN  ┆ MAIL       ┆  according to the final foxes.… │
│ 16370037013            ┆ … ┆ 1996-01-31    ┆ TAKE BACK RETURN  ┆ REG AIR    ┆ ourts cajole above the furiou   │
│ 1213246334            ┆ … ┆ 1996-05-16    ┆ NONE              ┆ AIR        ┆ s cajole busily above t         │
│ 12402715345            ┆ … ┆ 1996-04-01    ┆ NONE              ┆ FOB        ┆  the regular, regular pa        │
│ …          ┆ …         ┆ …         ┆ …            ┆ … ┆ …             ┆ …                 ┆ …          ┆ …                               │
│ 10484826322482375            ┆ … ┆ 1992-05-17    ┆ NONE              ┆ MAIL       ┆  final accounts haggle about t… │
│ 10484838136013611            ┆ … ┆ 1994-08-30    ┆ NONE              ┆ AIR        ┆ ely around the regula           │
│ 10484831005250552            ┆ … ┆ 1994-07-08    ┆ DELIVER IN PERSON ┆ MAIL       ┆ of the bli                      │
│ 104848312950995103            ┆ … ┆ 1994-05-16    ┆ TAKE BACK RETURN  ┆ RAIL       ┆ ests impress. sly, even i       │
│ 10484835007025764            ┆ … ┆ 1994-08-29    ┆ DELIVER IN PERSON ┆ REG AIR    ┆ regularly expre                 │
└────────────┴───────────┴───────────┴──────────────┴───┴───────────────┴───────────────────┴────────────┴─────────────────────────────────┘
import time
t=time.time();x=pl.read_excel(source="exli2.xlsx",sheet_name="Sheet1",);print(time.time()-t)
23.807374477386475
t=time.time();x.write_csv("plexli2.csv", separator=",");print(time.time()-t)
0.6844663619995117
t=time.time();pl.read_excel(source="exli2.xlsx",sheet_name="Sheet1",).write_csv("plexli.csv", separator=",");print(time.time()-t)
22.80883002281189import pandas as pd
import time
t=time.time();pd.read_excel("exli2.xlsx",engine="calamine").to_csv("pdexl2.csv", index=False, header=True);print(time.time()-t)
84.7169542312622md5sum p*exl2.csv
814e86b96b4e15ae43bf005ca0f5c8fe  pdexl2.csv
md5sum plexli2.csv
814e86b96b4e15ae43bf005ca0f5c8fe  plexli2.csv

虽然通过python中转,polars与rust中直接转的时间差距微小,而pandas则慢了很多。

http://www.xdnf.cn/news/1279063.html

相关文章:

  • 华为虚拟防火墙配置案例详解
  • 未来物联网大模型:物联网硬件+底层驱动+AI 自动生成和调优LUA脚本,
  • 数据备份与进程管理
  • TikTok登录时显示“访问太频繁,请稍后再试”该怎么办?
  • 【数据库】如何使用一款轻量级数据库SqlSugar进行批量更新,以及查看最终的Sql操作语句
  • 【车联网kafka】Kafka核心架构与实战经验(第三篇)
  • Python FastAPI + React + Nginx 阿里云WINDOWS ECS部署实战:从标准流程到踩坑解决全记录
  • 2025面试题——(12)
  • docker部署elasticsearch-8.11.1
  • <typeAliases>
  • 李宏毅2025《机器学习》-第十讲:AI“思想钢印”:深入解析大模型的知识编辑技术
  • k8s 命令记录
  • 揭开内容分发网络(CDN)的神秘面纱:互联网的隐形加速器
  • 【代码随想录day 17】 力扣 617.合并二叉树
  • 41.【.NET8 实战--孢子记账--从单体到微服务--转向微服务】--扩展功能--集成网关--网关集成Swagger
  • ZeroNews:如何构建安全(无需 V*N!)的工业物联网连接
  • 代币化股票的崛起:比特币安全吗?
  • “一车一码一池一充”:GB 17761-2024新国标下电动自行车的安全革命
  • Mac配置服务器工具Royal TSX
  • 套接字技术、视频加载技术、断点续传技术
  • AI炼丹日志-32- memvid 大模型数据库!用视频存储+语义检索实现秒级搜索
  • 《 AudioClassification-Pytorch:GitHub项目网页解读》
  • GaussDB 数据库架构师修炼(十三)安全管理(1)-账号的管理
  • OmniHuman:字节推出的AI项目,支持单张照片生成逼真全身动态视频
  • STM32 HAL驱动MPU6050传感器
  • TradingAgents-CN: 基于多智能体的中文金融交易决策框架
  • 基于深度学习的异常检测算法在时间序列数据中的应用
  • 【深度学习3】向量化(Vectorization)
  • 广州建筑矢量数据分享,带高度,矢量文件
  • Day118 Spring Security