calamine读取xlsx文件的方法比较
calamine是纯rust编写的电子表格文件读取库,支持Excel(xls, xlsx, xlsm, xlsb, xla, xlam)、opendocument电子表格(ods)。
它自带的示例程序excel_to_csv.rs采用workbook.worksheet_range来读取,很多AI和网上的介绍文章都说,处理大文件workbook.worksheet_range_ref更快,我实际测试了一下。
测试步骤
0.准备测试xlsx文件
在duckdb中生成tpch数据,用excel插件写入xlsx文件。
duckdb
load tpch;
load excel;
call dbgen(sf=1);
copy (from lineitem limit 1048575)to 'exli2.xlsx' WITH (FORMAT xlsx, HEADER true) ;
1.拉取calamine源码
git clone --depth=1 https://gitclone.com/github.com/tafia/calamine
Cloning into 'calamine'...
remote: 对象计数中: 133, 完成.
remote: 压缩对象中: 100% (109/109), 完成.
remote: Total 133 (delta 21), reused 133 (delta 21)
Receiving objects: 100% (133/133), 5.08 MiB | 48.00 KiB/s, done.
Resolving deltas: 100% (21/21), done.
2.编译执行
export PATH=/par/mold240/bin:$PATHcd /par/calamine
cargo build --release --example excel_to_csvtime /par/calamine/target/release/examples/excel_to_csv /par/exli2.xlsx Sheet1real 0m21.959s
user 0m19.332s
sys 0m2.180s
3.修改excel_to_csv.rs为excel_to_csv_ref.rs
基本上就是把函数worksheet_range改为worksheet_range_ref,加入use calamine::ReaderRef;和use calamine::DataRef;数据类型Data修改为DataRef<'>,再根据编译器提示加入DataRef::<'>::SharedString的处理。
// SPDX-License-Identifier: MIT
//
// Copyright 2016-2025, Johann Tuffe.//! An example for using the `calamine` crate to convert an Excel file to CSV.
//!
//! Converts XLSX, XLSM, XLSB, and XLS files. The filename and sheet name must
//! be specified as command line arguments. The output CSV will be written to a
//! file with the same name as the input file, but with a `.csv` extension.use std::env;
use std::fs::File;
use std::io::{BufWriter, Write};
use std::path::PathBuf;use calamine::{open_workbook_auto, Data, Range, Reader};
use calamine::ReaderRef;
use calamine::DataRef;
// usage: cargo run --example excel_to_csv file.xls[xmb] sheet_name
//
// Where:
// - `file.xls[xmb]` is the Excel file to convert. Required.
// - `sheet_name` is the name of the sheet to convert. Required.
//
// The output will be written to a file with the same name as the input file,
// including the path, but with a `.csv` extension.
//
fn main() {let excel_file = env::args().nth(1).expect("Please provide an excel file to convert");let sheet_name = env::args().nth(2).expect("Expecting a sheet name as second argument");let excel_path = PathBuf::from(excel_file);match excel_path.extension().and_then(|s| s.to_str()) {Some("xlsx") | Some("xlsm") | Some("xlsb") | Some("xls") => (),_ => panic!("Expecting an excel file"),}let csv_path = excel_path.with_extension("csv");let mut csv_file = BufWriter::new(File::create(csv_path).unwrap());let mut workbook = open_workbook_auto(&excel_path).unwrap();let range = workbook.worksheet_range_ref(&sheet_name).unwrap();write_to_csv(&mut csv_file, &range).unwrap();
}// Write the Excel data as strings to a CSV file. Uses a semicolon (`;`) as the
// field separator.
//
// Note, this is a simplified version of CSV and doesn't handle quoting of
// separators or other special cases. See the `csv.rs` crate for a more robust
// solution.
fn write_to_csv<W: Write>(output_file: &mut W, range: &Range<DataRef<'_>>) -> std::io::Result<()> {let max_column = range.get_size().1 - 1;for rows in range.rows() {for (col_number, cell_data) in rows.iter().enumerate() {match *cell_data {DataRef::<'_>::Empty => Ok(()),DataRef::<'_>::Int(ref i) => write!(output_file, "{i}"),DataRef::<'_>::Bool(ref b) => write!(output_file, "{b}"),DataRef::<'_>::Error(ref e) => write!(output_file, "{e:?}"),DataRef::<'_>::Float(ref f) => write!(output_file, "{f}"),DataRef::<'_>::DateTime(ref d) => write!(output_file, "{}", d.as_f64()),DataRef::<'_>::String(ref s) | DataRef::<'_>::DateTimeIso(ref s) | DataRef::<'_>::DurationIso(ref s) => {write!(output_file, "{s}")} , DataRef::<'_>::SharedString(s) => write!(output_file, "{s}")}?;// Write the field separator except for the last column.if col_number != max_column {write!(output_file, ";")?;}}write!(output_file, "\r\n")?;}Ok(())
}
4.编译执行
cargo build --release --example excel_to_csv_refmv /par/exli2.csv /par/exli2.csv.old
time /par/calamine/target/release/examples/excel_to_csv_ref /par/exli2.xlsx Sheet1real 0m22.062s
user 0m19.412s
sys 0m2.592s
md5sum /par/exli2.csv*
05bcc52a09f1ee2bbb7c9b08172fb7ae /par/exli2.csv
05bcc52a09f1ee2bbb7c9b08172fb7ae /par/exli2.csv.old
结论:worksheet_range和worksheet_range_ref读取速度和读取结果完全一样。
polars和pandas的read_excel函数都支持使用calamine引擎,polars目前还是默认使用calamine引擎,测试结果如下
import polars as plpl.read_excel(source="exli2.xlsx",sheet_name="Sheet1",)
shape: (1_048_575, 16)
┌────────────┬───────────┬───────────┬──────────────┬───┬───────────────┬───────────────────┬────────────┬─────────────────────────────────┐
│ l_orderkey ┆ l_partkey ┆ l_suppkey ┆ l_linenumber ┆ … ┆ l_receiptdate ┆ l_shipinstruct ┆ l_shipmode ┆ l_comment │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ ┆ date ┆ str ┆ str ┆ str │
╞════════════╪═══════════╪═══════════╪══════════════╪═══╪═══════════════╪═══════════════════╪════════════╪═════════════════════════════════╡
│ 1 ┆ 155190 ┆ 7706 ┆ 1 ┆ … ┆ 1996-03-22 ┆ DELIVER IN PERSON ┆ TRUCK ┆ to beans x-ray carefull │
│ 1 ┆ 67310 ┆ 7311 ┆ 2 ┆ … ┆ 1996-04-20 ┆ TAKE BACK RETURN ┆ MAIL ┆ according to the final foxes.… │
│ 1 ┆ 63700 ┆ 3701 ┆ 3 ┆ … ┆ 1996-01-31 ┆ TAKE BACK RETURN ┆ REG AIR ┆ ourts cajole above the furiou │
│ 1 ┆ 2132 ┆ 4633 ┆ 4 ┆ … ┆ 1996-05-16 ┆ NONE ┆ AIR ┆ s cajole busily above t │
│ 1 ┆ 24027 ┆ 1534 ┆ 5 ┆ … ┆ 1996-04-01 ┆ NONE ┆ FOB ┆ the regular, regular pa │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 1048482 ┆ 63224 ┆ 8237 ┆ 5 ┆ … ┆ 1992-05-17 ┆ NONE ┆ MAIL ┆ final accounts haggle about t… │
│ 1048483 ┆ 81360 ┆ 1361 ┆ 1 ┆ … ┆ 1994-08-30 ┆ NONE ┆ AIR ┆ ely around the regula │
│ 1048483 ┆ 10052 ┆ 5055 ┆ 2 ┆ … ┆ 1994-07-08 ┆ DELIVER IN PERSON ┆ MAIL ┆ of the bli │
│ 1048483 ┆ 129509 ┆ 9510 ┆ 3 ┆ … ┆ 1994-05-16 ┆ TAKE BACK RETURN ┆ RAIL ┆ ests impress. sly, even i │
│ 1048483 ┆ 50070 ┆ 2576 ┆ 4 ┆ … ┆ 1994-08-29 ┆ DELIVER IN PERSON ┆ REG AIR ┆ regularly expre │
└────────────┴───────────┴───────────┴──────────────┴───┴───────────────┴───────────────────┴────────────┴─────────────────────────────────┘
import time
t=time.time();x=pl.read_excel(source="exli2.xlsx",sheet_name="Sheet1",);print(time.time()-t)
23.807374477386475
t=time.time();x.write_csv("plexli2.csv", separator=",");print(time.time()-t)
0.6844663619995117
t=time.time();pl.read_excel(source="exli2.xlsx",sheet_name="Sheet1",).write_csv("plexli.csv", separator=",");print(time.time()-t)
22.80883002281189import pandas as pd
import time
t=time.time();pd.read_excel("exli2.xlsx",engine="calamine").to_csv("pdexl2.csv", index=False, header=True);print(time.time()-t)
84.7169542312622md5sum p*exl2.csv
814e86b96b4e15ae43bf005ca0f5c8fe pdexl2.csv
md5sum plexli2.csv
814e86b96b4e15ae43bf005ca0f5c8fe plexli2.csv
虽然通过python中转,polars与rust中直接转的时间差距微小,而pandas则慢了很多。