openpi π₀ 项目部署运行逻辑(五)——模型微调
使用开源相同的配置重新运行微调指令如下,使用 --overwrite 标志覆盖现有检查点:
XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py pi0_aloha_cable_sort --exp-name=5_10 --overwrite
该命令将训练进度记录到控制台,并将检查点保存到 checkpoints 目录
设置 XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 --这能使 JAX 使用高达90%的 GPU 内存(默认值为75%)
TrainConfig.name 为 pi0_aloha_cable_sort
微调主要分为两步:计算训练数据的归一化统计量 -> 运行微调
目录
1 归一化统计(Normalization statistics)
1.1 重新加载归一化统计(Reloading normalization statistics)
1.2 提供的预训练归一化统计(Provided Pre-training Normalization Statistics)
1.3 Pi0模型动作空间定义(Pi0 Model Action Space Definitions)
1.4 配置样例
2 模型微调
2.1 指令设置
2.2 TypeError: LeRobotDatasetMetadata.__init__() got an unexpected keyword argument 'local_files_only'
2.3 FileNotFoundError: [Errno 2] No such file or directory: '/home/yejiangchen/.cache/huggingface/lerobot/datasets/collection_coil/meta/info.json'
2.4 视频解码库 torchcodec 无法加载 FFmpeg 依赖
2.5 KeyError: 'observation.images.cam_high'
2.6 ValueError: operands could not be broadcast together with shapes (14,) (18,) state = _joint_flip_mask() * state
1 归一化统计(Normalization statistics)
按照常规做法,模型在策略训练和推理过程中会对本体状态(proprioceptive state)输入和动作目标(action targets)进行归一化。用于归一化的统计数据是在训练数据上计算得到,并与模型检查点(checkpoint)一起存储
有关重载归一化统计量的详细说明,请参阅 norm_stats.md
1.1 重新加载归一化统计(Reloading normalization statistics)
当在新数据集上微调模型时,你需要决定是(A) 复用已有的归一化统计,还是(B) 用你的新训练数据重新计算归一化统计。选择哪种方式更合适,取决于机器人及任务与预训练数据集中机器人和任务分布的相似程度。下文列出了每个模型可用的所有预训练归一化统计信息
如果你的目标机器人与某项预训练统计对应的机器人相同,建议直接加载对应的归一化统计。通过加载同样的归一化统计,数据集中的动作会对模型来说更加“熟悉”,这通常有助于提升模型表现。可以通过在训练配置(training config)中添加一个 AssetsConfig,指向相应的检查点目录和归一化统计 ID,来加载归一化统计。例如,针对 pi0_base 检查点中的 Trossen(即ALOHA)机器人,可以如下配置:
TrainConfig(...data=LeRobotAlohaDataConfig(...assets=AssetsConfig(assets_dir="s3://openpi-assets/checkpoints/pi0_base/assets",asset_id="trossen",),),
)
如需加载归一化统计的完整训练配置示例,请参考 training config file 中的 pi0_aloha_pen_uncap 配置
Note: 要成功加载归一化统计,你的机器人与数据集应当遵循与预训练时一致的动作空间定义
Note #2: 加载归一化统计是否有益,取决于机器人与任务和预训练数据集的相似程度。建议始终尝试两种方式:即加载预训练统计和用新数据集重新计算统计(具体如何计算请见 main README),最后选择任务效果更好的那一种
1.2 提供的预训练归一化统计(Provided Pre-training Normalization Statistics)
下表列出了项目提供的所有预训练归一化统计,适用于 pi0_base 和 pi0_fast_base 模型
对于 pi0_base,请将 assets_dir 设置为 s3://openpi-assets/checkpoints/pi0_base/assets
对于 pi0_fast_base,请设置为 s3://openpi-assets/checkpoints/pi0_fast_base/assets
Robot | Description | Asset ID |
---|---|---|
ALOHA | 6-DoF dual arm robot with parallel grippers | trossen |
Mobile ALOHA | Mobile version of ALOHA mounted on a Slate base | trossen_mobile |
Franka Emika (DROID) | 7-DoF arm with parallel gripper based on the DROID setup | droid |
Franka Emika (non-DROID) | Franka FR3 arm with Robotiq 2F-85 gripper | franka |
UR5e | 6-DoF UR5e arm with Robotiq 2F-85 gripper | ur5e |
UR5e bi-manual | Bi-manual UR5e setup with Robotiq 2F-85 grippers | ur5e_dual |
ARX | Bi-manual ARX-5 robot arm setup with parallel gripper | arx |
ARX mobile | Mobile version of bi-manual ARX-5 robot arm setup mounted on a Slate base | arx_mobile |
Fibocom mobile | Fibocom mobile robot with 2x ARX-5 arms | fibocom_mobile |
1.3 Pi0模型动作空间定义(Pi0 Model Action Space Definitions)
开箱即用时,pi0_base 和 pi0_fast_base 模型采用以下动作空间定义(从机器人背后朝工作空间看,左/右臂的定义如下):
"dim_0:dim_5":左臂各关节角度
"dim_6":左臂夹爪位置
"dim_7:dim_12":右臂各关节角度(仅双臂机器人)
"dim_13":右臂夹爪位置(仅双臂机器人)#对于移动机器人:
"dim_14:dim_15":底座x-y方向速度(仅移动机器人)
本体状态(proprioceptive state)采用与动作空间相同的定义,但对于移动机器人,不包括最后两个底座x-y位置维度
对于7自由度机器人(如Franka),动作空间的前7个维度用于关节动作,第8个维度用于夹爪动作
Pi 系列机器人的通用信息(General info for Pi robots)
- 关节角度以弧度为单位,零位对应于每个机器人接口库报告的零位置,ALOHA除外,其标准代码采用略有不同的约定(详见 ALOHA example code)
- 夹爪位置范围为[0.0, 1.0],0.0代表完全打开,1.0代表完全闭合
- 控制频率:UR5e 和 Franka 为 20Hz,ARX 和 Trossen(ALOHA)为50Hz
- 对于DROID,采用原始 DROID 动作配置,前7个维度为关节速度动作,第8个维度为夹爪动作,控制频率为15Hz
1.4 配置样例
根据上述归一化统计配置方法,对自己 ALOHA 任务进行配置,建议参考 pi0_aloha_pen_uncap
配置文件位置为:openpi-main/src/openpi/training/config.py
参考配置自己样例为:
TrainConfig(name="pi0_aloha_cable_sort",model=pi0.Pi0Config(),data=LeRobotAlohaDataConfig(repo_id="datasets/collection_coil",assets=AssetsConfig(assets_dir="s3://openpi-assets/checkpoints/pi0_base/assets",asset_id="trossen",),default_prompt="sort the cable",repack_transforms=_transforms.Group(inputs=[_transforms.RepackTransform({"images": {"cam_high": "observation.images.cam_low","cam_left_wrist": "observation.images.cam_left_wrist","cam_right_wrist": "observation.images.cam_right_wrist",},"state": "observation.state","actions": "action",})]),base_config=DataConfig(local_files_only=False, # Set to True for local-only datasets.),),weight_loader=weight_loaders.CheckpointWeightLoader("s3://openpi-assets/checkpoints/pi0_base/params"),num_train_steps=20_000,),
几个参数说明:
- repo_id 为本地数据集位置及命名,默认位置位于 /home/yejiangchen/.cache/huggingface/lerobot/datasets/collection_coil
- _transforms.RepackTransform 中参数会被 transform,因此左侧应与数采时有相同定义
- base_model 默认下载位置:Downloading s3://openpi-assets/checkpoints/pi0_base/params to /home/yejiangchen/.cache/openpi/openpi-assets/checkpoints/pi0_base/params (14672:download.py:93)
2 模型微调
2.1 指令设置
模型训练可以直接运行最小参数指令:
XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py pi0_aloha_cable_sort --exp-name=5_10 --overwrite
更多参数说明后面分析脚本再说
这里主要记录一下运行逻辑和 bug 解决方案
2.2 TypeError: LeRobotDatasetMetadata.__init__() got an unexpected keyword argument 'local_files_only'
(pi0) yejiangchen@yejiangchen:~/Desktop/Codes/openpi-main$ XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py pi0_aloha_cable_sort --exp-name=5_10 --overwrite
/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/tyro/_parsers.py:332: UserWarning: The field model.action-expert-variant is annotated with type typing.Literal['dummy', 'gemma_300m', 'gemma_2b', 'gemma_2b_lora'], but the default value gemma_300m_lora has type <class 'str'>. We'll try to handle this gracefully, but it may cause unexpected behavior.warnings.warn(message)
10:08:23.888 [I] Running on: yejiangchen (7592:train.py:195)
INFO:2025-05-27 10:08:24,045:jax._src.xla_bridge:945: Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
10:08:24.045 [I] Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig' (7592:xla_bridge.py:945)
INFO:2025-05-27 10:08:24,046:jax._src.xla_bridge:945: Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
10:08:24.046 [I] Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory (7592:xla_bridge.py:945)
10:08:24.293 [I] Created BasePyTreeCheckpointHandler: pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=None (7592:base_pytree_checkpoint_handler.py:332)
10:08:24.294 [I] Created BasePyTreeCheckpointHandler: pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=None (7592:base_pytree_checkpoint_handler.py:332)
10:08:24.294 [I] [thread=MainThread] Failed to get flag value for EXPERIMENTAL_ORBAX_USE_DISTRIBUTED_PROCESS_ID. (7592:multihost.py:375)
10:08:24.294 [I] [process=0][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'assets': <openpi.training.checkpoints.CallbackHandler object at 0x7823ab281b50>, 'train_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7823ab7f7a10>, 'params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7823ab2833d0>}, handler_registry=None (7592:checkpoint_manager.py:622)
10:08:24.294 [I] Deferred registration for item: "assets". Adding handler <openpi.training.checkpoints.CallbackHandler object at 0x7823ab281b50> for item "assets" and save args <class 'openpi.training.checkpoints.CallbackSave'> and restore args <class 'openpi.training.checkpoints.CallbackRestore'> to _handler_registry. (7592:composite_checkpoint_handler.py:239)
10:08:24.294 [I] Deferred registration for item: "train_state". Adding handler <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7823ab7f7a10> for item "train_state" and save args <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'> and restore args <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'> to _handler_registry. (7592:composite_checkpoint_handler.py:239)
10:08:24.294 [I] Deferred registration for item: "params". Adding handler <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7823ab2833d0> for item "params" and save args <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'> and restore args <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'> to _handler_registry. (7592:composite_checkpoint_handler.py:239)
10:08:24.294 [I] Deferred registration for item: "metrics". Adding handler <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7823ab500510> for item "metrics" and save args <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'> and restore args <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'> to _handler_registry. (7592:composite_checkpoint_handler.py:239)
10:08:24.294 [I] Initialized registry DefaultCheckpointHandlerRegistry({('assets', <class 'openpi.training.checkpoints.CallbackSave'>): <openpi.training.checkpoints.CallbackHandler object at 0x7823ab281b50>, ('assets', <class 'openpi.training.checkpoints.CallbackRestore'>): <openpi.training.checkpoints.CallbackHandler object at 0x7823ab281b50>, ('train_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7823ab7f7a10>, ('train_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7823ab7f7a10>, ('params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7823ab2833d0>, ('params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7823ab2833d0>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7823ab500510>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7823ab500510>}). (7592:composite_checkpoint_handler.py:508)
10:08:24.294 [I] orbax-checkpoint version: 0.11.1 (7592:abstract_checkpointer.py:35)
10:08:24.294 [I] [process=0][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>.<lambda> at 0x7823ab244cc0> timeout: 7200 secs and primary_host=0 for async checkpoint writes (7592:async_checkpointer.py:80)
10:08:24.294 [I] Found 0 checkpoint steps in /home/yejiangchen/Desktop/Codes/openpi-main/checkpoints/pi0_aloha_cable_sort/5_10 (7592:checkpoint_manager.py:1528)
10:08:24.294 [I] Saving root metadata (7592:checkpoint_manager.py:1569)
10:08:24.294 [I] [process=0][thread=MainThread] Skipping global process sync, barrier name: CheckpointManager:save_metadata (7592:multihost.py:293)
10:08:24.294 [I] [process=0][thread=MainThread] CheckpointManager created, primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=1, max_to_keep=1, keep_time_interval=None, keep_period=5000, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=False, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=AsyncOptions(timeout_secs=7200, barrier_sync_fn=None, post_finalization_callback=None, create_directories_asynchronously=False), multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None), root_directory=/home/yejiangchen/Desktop/Codes/openpi-main/checkpoints/pi0_aloha_cable_sort/5_10: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x7823ab638ed0> (7592:checkpoint_manager.py:797)
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 2
wandb: You chose 'Use an existing W&B account'
wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
wandb: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:
wandb: Appending key for api.wandb.ai to your netrc file: /home/yejiangchen/.netrc
wandb: Tracking run with wandb version 0.19.1
wandb: Run data is saved locally in /home/yejiangchen/Desktop/Codes/openpi-main/wandb/run-20250527_100933-wcn78hrz
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run 5_10
wandb: ⭐️ View project at https://wandb.ai/yejiangchen-Nanjing%20University%20of%20Aeronautics%20and%20Astron/openpi
wandb: 🚀 View run at https://wandb.ai/yejiangchen-Nanjing%20University%20of%20Aeronautics%20and%20Astron/openpi/runs/wcn78hrz
10:09:34.221 [I] Downloading s3://openpi-assets/checkpoints/pi0_base/assets/trossen to /home/yejiangchen/.cache/openpi/openpi-assets/checkpoints/pi0_base/assets/trossen (7592:download.py:93)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.73k/2.73k [00:08<00:00, 327iB/s]
10:10:45.609 [I] Loaded norm stats from s3://openpi-assets/checkpoints/pi0_base/assets/trossen (7592:config.py:166)
Traceback (most recent call last):File "/home/yejiangchen/Desktop/Codes/openpi-main/scripts/train.py", line 273, in <module>main(_config.cli())File "/home/yejiangchen/Desktop/Codes/openpi-main/scripts/train.py", line 219, in maindata_loader = _data_loader.create_data_loader(^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/src/openpi/training/data_loader.py", line 155, in create_data_loaderdataset = create_dataset(data_config, config.model)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/src/openpi/training/data_loader.py", line 92, in create_datasetdataset_meta = lerobot_dataset.LeRobotDatasetMetadata(repo_id, local_files_only=data_config.local_files_only)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: LeRobotDatasetMetadata.__init__() got an unexpected keyword argument 'local_files_only'
wandb: 🚀 View run 5_10 at: https://wandb.ai/yejiangchen-Nanjing University of Aeronautics and Astron/openpi/runs/wcn78hrz
wandb: Find logs at: wandb/run-20250527_100933-wcn78hrz/logs
报错原因:lerobot_dataset.LeRobotDatasetMetadata 类的 __init__ 方法不支持 local_files_only 这个参数,但是代码却传进去了,即 lerobot 版本与 openpi 项目代码有接口不兼容
解决方法:定位函数于 openpi-main/src/openpi/training/data_loader.py
将此脚本中的全部包含 local_files_only 参数的代码删掉,包括两处:
- dataset_meta = lerobot_dataset.LeRobotDatasetMetadata(repo_id, local_files_only=data_config.local_files_only)
+ dataset_meta = lerobot_dataset.LeRobotDatasetMetadata(repo_id)
- local_files_only=data_config.local_files_only,
2.3 FileNotFoundError: [Errno 2] No such file or directory: '/home/yejiangchen/.cache/huggingface/lerobot/datasets/collection_coil/meta/info.json'
(pi0) yejiangchen@yejiangchen:~/Desktop/Codes/openpi-main$ XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py pi0_aloha_cable_sort --exp-name=5_10 --overwrite
/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/tyro/_parsers.py:332: UserWarning: The field model.action-expert-variant is annotated with type typing.Literal['dummy', 'gemma_300m', 'gemma_2b', 'gemma_2b_lora'], but the default value gemma_300m_lora has type <class 'str'>. We'll try to handle this gracefully, but it may cause unexpected behavior.warnings.warn(message)
10:28:59.029 [I] Running on: yejiangchen (9318:train.py:195)
INFO:2025-05-27 10:28:59,174:jax._src.xla_bridge:945: Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
10:28:59.174 [I] Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig' (9318:xla_bridge.py:945)
INFO:2025-05-27 10:28:59,175:jax._src.xla_bridge:945: Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
10:28:59.175 [I] Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory (9318:xla_bridge.py:945)
10:28:59.389 [I] Wiped checkpoint directory /home/yejiangchen/Desktop/Codes/openpi-main/checkpoints/pi0_aloha_cable_sort/5_10 (9318:checkpoints.py:25)
10:28:59.389 [I] Created BasePyTreeCheckpointHandler: pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=None (9318:base_pytree_checkpoint_handler.py:332)
10:28:59.389 [I] Created BasePyTreeCheckpointHandler: pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=None (9318:base_pytree_checkpoint_handler.py:332)
10:28:59.389 [I] [thread=MainThread] Failed to get flag value for EXPERIMENTAL_ORBAX_USE_DISTRIBUTED_PROCESS_ID. (9318:multihost.py:375)
10:28:59.389 [I] [process=0][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'assets': <openpi.training.checkpoints.CallbackHandler object at 0x7cbe75e95a50>, 'train_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7cbe75ecab10>, 'params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7cbe75ecae90>}, handler_registry=None (9318:checkpoint_manager.py:622)
10:28:59.389 [I] Deferred registration for item: "assets". Adding handler <openpi.training.checkpoints.CallbackHandler object at 0x7cbe75e95a50> for item "assets" and save args <class 'openpi.training.checkpoints.CallbackSave'> and restore args <class 'openpi.training.checkpoints.CallbackRestore'> to _handler_registry. (9318:composite_checkpoint_handler.py:239)
10:28:59.389 [I] Deferred registration for item: "train_state". Adding handler <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7cbe75ecab10> for item "train_state" and save args <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'> and restore args <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'> to _handler_registry. (9318:composite_checkpoint_handler.py:239)
10:28:59.389 [I] Deferred registration for item: "params". Adding handler <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7cbe75ecae90> for item "params" and save args <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'> and restore args <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'> to _handler_registry. (9318:composite_checkpoint_handler.py:239)
10:28:59.389 [I] Deferred registration for item: "metrics". Adding handler <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7cbfaa599a90> for item "metrics" and save args <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'> and restore args <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'> to _handler_registry. (9318:composite_checkpoint_handler.py:239)
10:28:59.389 [I] Initialized registry DefaultCheckpointHandlerRegistry({('assets', <class 'openpi.training.checkpoints.CallbackSave'>): <openpi.training.checkpoints.CallbackHandler object at 0x7cbe75e95a50>, ('assets', <class 'openpi.training.checkpoints.CallbackRestore'>): <openpi.training.checkpoints.CallbackHandler object at 0x7cbe75e95a50>, ('train_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7cbe75ecab10>, ('train_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7cbe75ecab10>, ('params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7cbe75ecae90>, ('params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7cbe75ecae90>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7cbfaa599a90>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7cbfaa599a90>}). (9318:composite_checkpoint_handler.py:508)
10:28:59.390 [I] orbax-checkpoint version: 0.11.1 (9318:abstract_checkpointer.py:35)
10:28:59.390 [I] [process=0][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>.<lambda> at 0x7cbe75d3e2a0> timeout: 7200 secs and primary_host=0 for async checkpoint writes (9318:async_checkpointer.py:80)
10:28:59.390 [I] Found 0 checkpoint steps in /home/yejiangchen/Desktop/Codes/openpi-main/checkpoints/pi0_aloha_cable_sort/5_10 (9318:checkpoint_manager.py:1528)
10:28:59.390 [I] Saving root metadata (9318:checkpoint_manager.py:1569)
10:28:59.390 [I] [process=0][thread=MainThread] Skipping global process sync, barrier name: CheckpointManager:save_metadata (9318:multihost.py:293)
10:28:59.390 [I] [process=0][thread=MainThread] CheckpointManager created, primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=1, max_to_keep=1, keep_time_interval=None, keep_period=5000, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=False, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=AsyncOptions(timeout_secs=7200, barrier_sync_fn=None, post_finalization_callback=None, create_directories_asynchronously=False), multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None), root_directory=/home/yejiangchen/Desktop/Codes/openpi-main/checkpoints/pi0_aloha_cable_sort/5_10: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x7cbe75fb4510> (9318:checkpoint_manager.py:797)
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: yejiangchen (yejiangchen-Nanjing University of Aeronautics and Astron). Use wandb login --relogin to force relogin
wandb: Tracking run with wandb version 0.19.1
wandb: Run data is saved locally in /home/yejiangchen/Desktop/Codes/openpi-main/wandb/run-20250527_102901-7f7a36a5
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run 5_10
wandb: ⭐️ View project at https://wandb.ai/yejiangchen-Nanjing%20University%20of%20Aeronautics%20and%20Astron/openpi
wandb: 🚀 View run at https://wandb.ai/yejiangchen-Nanjing%20University%20of%20Aeronautics%20and%20Astron/openpi/runs/7f7a36a5
10:29:02.101 [I] Loaded norm stats from s3://openpi-assets/checkpoints/pi0_base/assets/trossen (9318:config.py:166)
Traceback (most recent call last):File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/lerobot/common/datasets/lerobot_dataset.py", line 95, in __init__self.load_metadata()File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/lerobot/common/datasets/lerobot_dataset.py", line 105, in load_metadataself.info = load_info(self.root)^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/lerobot/common/datasets/utils.py", line 178, in load_infoinfo = load_json(local_dir / INFO_PATH)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/lerobot/common/datasets/utils.py", line 146, in load_jsonwith open(fpath) as f:^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/home/yejiangchen/.cache/huggingface/lerobot/datasets/collection_coil/meta/info.json'During handling of the above exception, another exception occurred:Traceback (most recent call last):File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_statusresponse.raise_for_status()File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/requests/models.py", line 1024, in raise_for_statusraise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/datasets/datasets/collection_coil/refsThe above exception was the direct cause of the following exception:Traceback (most recent call last):File "/home/yejiangchen/Desktop/Codes/openpi-main/scripts/train.py", line 273, in <module>main(_config.cli())File "/home/yejiangchen/Desktop/Codes/openpi-main/scripts/train.py", line 219, in maindata_loader = _data_loader.create_data_loader(^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/src/openpi/training/data_loader.py", line 155, in create_data_loaderdataset = create_dataset(data_config, config.model)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/src/openpi/training/data_loader.py", line 92, in create_datasetdataset_meta = lerobot_dataset.LeRobotDatasetMetadata(repo_id)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/lerobot/common/datasets/lerobot_dataset.py", line 98, in __init__self.revision = get_safe_version(self.repo_id, self.revision)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/lerobot/common/datasets/utils.py", line 327, in get_safe_versionhub_versions = get_repo_versions(repo_id)^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/lerobot/common/datasets/utils.py", line 309, in get_repo_versionsrepo_refs = api.list_repo_refs(repo_id, repo_type="dataset")^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fnreturn fn(*args, **kwargs)^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/huggingface_hub/hf_api.py", line 3088, in list_repo_refshf_raise_for_status(response)File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 454, in hf_raise_for_statusraise _format(RepositoryNotFoundError, message, response) from e
huggingface_hub.errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-6835236e-7372dce941fafd8719974a45;191fd555-fc62-4b15-8704-825d1bbf5e5d)Repository Not Found for url: https://huggingface.co/api/datasets/datasets/collection_coil/refs.
Please make sure you specified the correct repo_id and repo_type.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.
wandb: 🚀 View run 5_10 at: https://wandb.ai/yejiangchen-Nanjing University of Aeronautics and Astron/openpi/runs/7f7a36a5
wandb: Find logs at: wandb/run-20250527_102901-7f7a36a5/logs
此处本地微调数据集没有找到,即修改 config.py 中:
repo_id="datasets/collection_coil"
注意此处 lerobot 默认本地 cache 路径是:~/.cache/huggingface/lerobot/datasets/{repo_id}/...
将本地 lerobot 格式数据集放到对应位置即可
2.4 视频解码库 torchcodec 无法加载 FFmpeg 依赖
You said:
(pi0) yejiangchen@yejiangchen:~/Desktop/Codes/openpi-main$ XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py pi0_aloha_cable_sort --exp-name=5_10 --overwrite
/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/tyro/_parsers.py:332: UserWarning: The field model.action-expert-variant is annotated with type typing.Literal['dummy', 'gemma_300m', 'gemma_2b', 'gemma_2b_lora'], but the default value gemma_300m_lora has type <class 'str'>. We'll try to handle this gracefully, but it may cause unexpected behavior.warnings.warn(message)
10:36:11.021 [I] Running on: yejiangchen (10245:train.py:195)
INFO:2025-05-27 10:36:11,167:jax._src.xla_bridge:945: Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
10:36:11.167 [I] Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig' (10245:xla_bridge.py:945)
INFO:2025-05-27 10:36:11,167:jax._src.xla_bridge:945: Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
10:36:11.167 [I] Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory (10245:xla_bridge.py:945)
10:36:11.388 [I] Wiped checkpoint directory /home/yejiangchen/Desktop/Codes/openpi-main/checkpoints/pi0_aloha_cable_sort/5_10 (10245:checkpoints.py:25)
10:36:11.388 [I] Created BasePyTreeCheckpointHandler: pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=None (10245:base_pytree_checkpoint_handler.py:332)
10:36:11.388 [I] Created BasePyTreeCheckpointHandler: pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=None (10245:base_pytree_checkpoint_handler.py:332)
10:36:11.388 [I] [thread=MainThread] Failed to get flag value for EXPERIMENTAL_ORBAX_USE_DISTRIBUTED_PROCESS_ID. (10245:multihost.py:375)
10:36:11.388 [I] [process=0][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'assets': <openpi.training.checkpoints.CallbackHandler object at 0x702cf06aa090>, 'train_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x702cf0758990>, 'params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x702cf063a910>}, handler_registry=None (10245:checkpoint_manager.py:622)
10:36:11.388 [I] Deferred registration for item: "assets". Adding handler <openpi.training.checkpoints.CallbackHandler object at 0x702cf06aa090> for item "assets" and save args <class 'openpi.training.checkpoints.CallbackSave'> and restore args <class 'openpi.training.checkpoints.CallbackRestore'> to _handler_registry. (10245:composite_checkpoint_handler.py:239)
10:36:11.388 [I] Deferred registration for item: "train_state". Adding handler <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x702cf0758990> for item "train_state" and save args <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'> and restore args <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'> to _handler_registry. (10245:composite_checkpoint_handler.py:239)
10:36:11.388 [I] Deferred registration for item: "params". Adding handler <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x702cf063a910> for item "params" and save args <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'> and restore args <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'> to _handler_registry. (10245:composite_checkpoint_handler.py:239)
10:36:11.388 [I] Deferred registration for item: "metrics". Adding handler <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x702cf06ab010> for item "metrics" and save args <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'> and restore args <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'> to _handler_registry. (10245:composite_checkpoint_handler.py:239)
10:36:11.388 [I] Initialized registry DefaultCheckpointHandlerRegistry({('assets', <class 'openpi.training.checkpoints.CallbackSave'>): <openpi.training.checkpoints.CallbackHandler object at 0x702cf06aa090>, ('assets', <class 'openpi.training.checkpoints.CallbackRestore'>): <openpi.training.checkpoints.CallbackHandler object at 0x702cf06aa090>, ('train_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x702cf0758990>, ('train_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x702cf0758990>, ('params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x702cf063a910>, ('params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x702cf063a910>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x702cf06ab010>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x702cf06ab010>}). (10245:composite_checkpoint_handler.py:508)
10:36:11.389 [I] orbax-checkpoint version: 0.11.1 (10245:abstract_checkpointer.py:35)
10:36:11.389 [I] [process=0][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>.<lambda> at 0x702cf056a2a0> timeout: 7200 secs and primary_host=0 for async checkpoint writes (10245:async_checkpointer.py:80)
10:36:11.389 [I] Found 0 checkpoint steps in /home/yejiangchen/Desktop/Codes/openpi-main/checkpoints/pi0_aloha_cable_sort/5_10 (10245:checkpoint_manager.py:1528)
10:36:11.389 [I] Saving root metadata (10245:checkpoint_manager.py:1569)
10:36:11.389 [I] [process=0][thread=MainThread] Skipping global process sync, barrier name: CheckpointManager:save_metadata (10245:multihost.py:293)
10:36:11.389 [I] [process=0][thread=MainThread] CheckpointManager created, primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=1, max_to_keep=1, keep_time_interval=None, keep_period=5000, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=False, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=AsyncOptions(timeout_secs=7200, barrier_sync_fn=None, post_finalization_callback=None, create_directories_asynchronously=False), multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None), root_directory=/home/yejiangchen/Desktop/Codes/openpi-main/checkpoints/pi0_aloha_cable_sort/5_10: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x702cf09bd110> (10245:checkpoint_manager.py:797)
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: yejiangchen (yejiangchen-Nanjing University of Aeronautics and Astron). Use wandb login --relogin to force relogin
wandb: Tracking run with wandb version 0.19.1
wandb: Run data is saved locally in /home/yejiangchen/Desktop/Codes/openpi-main/wandb/run-20250527_103612-2p6k0eoa
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run 5_10
wandb: ⭐️ View project at https://wandb.ai/yejiangchen-Nanjing%20University%20of%20Aeronautics%20and%20Astron/openpi
wandb: 🚀 View run at https://wandb.ai/yejiangchen-Nanjing%20University%20of%20Aeronautics%20and%20Astron/openpi/runs/2p6k0eoa
10:36:13.534 [I] Loaded norm stats from s3://openpi-assets/checkpoints/pi0_base/assets/trossen (10245:config.py:166)
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49/49 [00:00<00:00, 580567.50it/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49/49 [00:00<00:00, 682793.67files/s]
Generating train split: 36003 examples [00:00, 1146228.09 examples/s]
Traceback (most recent call last):File "/home/yejiangchen/Desktop/Codes/openpi-main/scripts/train.py", line 273, in <module>main(_config.cli())File "/home/yejiangchen/Desktop/Codes/openpi-main/scripts/train.py", line 226, in mainbatch = next(data_iter)^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/src/openpi/training/data_loader.py", line 176, in __iter__for batch in self._data_loader:File "/home/yejiangchen/Desktop/Codes/openpi-main/src/openpi/training/data_loader.py", line 256, in __iter__batch = next(data_iter)^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 708, in __next__data = self._next_data()^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1480, in _next_datareturn self._process_data(data)^^^^^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1505, in _process_datadata.reraise()File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torch/_utils.py", line 733, in reraiseraise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loopdata = fetcher.fetch(index) # type: ignore[possibly-undefined]^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetchdata = [self.dataset[idx] for idx in possibly_batched_index]^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>data = [self.dataset[idx] for idx in possibly_batched_index]~~~~~~~~~~~~^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/src/openpi/training/data_loader.py", line 47, in __getitem__return self._transform(self._dataset[index])~~~~~~~~~~~~~^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/lerobot/common/datasets/lerobot_dataset.py", line 739, in __getitem__video_frames = self._query_videos(query_timestamps, ep_idx)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/lerobot/common/datasets/lerobot_dataset.py", line 711, in _query_videosframes = decode_video_frames(video_path, query_ts, self.tolerance_s, self.video_backend)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/lerobot/common/datasets/video_utils.py", line 65, in decode_video_framesreturn decode_video_frames_torchcodec(video_path, timestamps, tolerance_s)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/lerobot/common/datasets/video_utils.py", line 189, in decode_video_frames_torchcodecfrom torchcodec.decoders import VideoDecoderFile "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torchcodec/__init__.py", line 10, in <module>from . import decoders, samplers # noqa^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torchcodec/decoders/__init__.py", line 7, in <module>from ._core import VideoStreamMetadataFile "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torchcodec/decoders/_core/__init__.py", line 8, in <module>from ._metadata import (File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torchcodec/decoders/_core/_metadata.py", line 15, in <module>from torchcodec.decoders._core.video_decoder_ops import (File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torchcodec/decoders/_core/video_decoder_ops.py", line 59, in <module>load_torchcodec_extension()File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torchcodec/decoders/_core/video_decoder_ops.py", line 44, in load_torchcodec_extensionraise RuntimeError(
RuntimeError: Could not load libtorchcodec. Likely causes:1. FFmpeg is not properly installed in your environment. We supportversions 4, 5, 6 and 7.2. The PyTorch version (2.6.0+cu124) is not compatible withthis version of TorchCodec. Refer to the version compatibilitytable:https://github.com/pytorch/torchcodec?tab=readme-ov-file#installing-torchcodec.3. Another runtime dependency; see exceptions below.The following exceptions were raised as we tried to load libtorchcodec:[start of libtorchcodec loading traceback]
libavutil.so.59: cannot open shared object file: No such file or directory
libavutil.so.58: cannot open shared object file: No such file or directory
libavutil.so.57: cannot open shared object file: No such file or directory
libavutil.so.56: cannot open shared object file: No such file or directory
[end of libtorchcodec loading traceback].wandb: 🚀 View run 5_10 at: https://wandb.ai/yejiangchen-Nanjing University of Aeronautics and Astron/openpi/runs/2p6k0eoa
wandb: Find logs at: wandb/run-20250527_103612-2p6k0eoa/logs
torchcodec(高效视频帧读取和解码)需要 FFmpeg 库(libavutil.so 等)
安装 FFmpeg 及开发依赖:
sudo apt-get update
sudo apt-get install ffmpeg libavutil-dev libavcodec-dev libavformat-dev libswscale-dev
安装完后,重启 conda/venv/uv 环境,确保新的动态库可见,然后重新运行
2.5 KeyError: 'observation.images.cam_high'
(pi0) yejiangchen@yejiangchen:~/Desktop/Codes/openpi-main$ XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py pi0_aloha_cable_sort --exp-name=5_10 --overwrite
/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/tyro/_parsers.py:332: UserWarning: The field model.action-expert-variant is annotated with type typing.Literal['dummy', 'gemma_300m', 'gemma_2b', 'gemma_2b_lora'], but the default value gemma_300m_lora has type <class 'str'>. We'll try to handle this gracefully, but it may cause unexpected behavior.warnings.warn(message)
10:39:48.897 [I] Running on: yejiangchen (12059:train.py:195)
INFO:2025-05-27 10:39:49,049:jax._src.xla_bridge:945: Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
10:39:49.049 [I] Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig' (12059:xla_bridge.py:945)
INFO:2025-05-27 10:39:49,049:jax._src.xla_bridge:945: Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
10:39:49.049 [I] Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory (12059:xla_bridge.py:945)
10:39:49.263 [I] Wiped checkpoint directory /home/yejiangchen/Desktop/Codes/openpi-main/checkpoints/pi0_aloha_cable_sort/5_10 (12059:checkpoints.py:25)
10:39:49.263 [I] Created BasePyTreeCheckpointHandler: pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=None (12059:base_pytree_checkpoint_handler.py:332)
10:39:49.263 [I] Created BasePyTreeCheckpointHandler: pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=None (12059:base_pytree_checkpoint_handler.py:332)
10:39:49.263 [I] [thread=MainThread] Failed to get flag value for EXPERIMENTAL_ORBAX_USE_DISTRIBUTED_PROCESS_ID. (12059:multihost.py:375)
10:39:49.263 [I] [process=0][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'assets': <openpi.training.checkpoints.CallbackHandler object at 0x7ab6c78cfb50>, 'train_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7ab6c784c450>, 'params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7ab6d2237610>}, handler_registry=None (12059:checkpoint_manager.py:622)
10:39:49.263 [I] Deferred registration for item: "assets". Adding handler <openpi.training.checkpoints.CallbackHandler object at 0x7ab6c78cfb50> for item "assets" and save args <class 'openpi.training.checkpoints.CallbackSave'> and restore args <class 'openpi.training.checkpoints.CallbackRestore'> to _handler_registry. (12059:composite_checkpoint_handler.py:239)
10:39:49.263 [I] Deferred registration for item: "train_state". Adding handler <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7ab6c784c450> for item "train_state" and save args <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'> and restore args <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'> to _handler_registry. (12059:composite_checkpoint_handler.py:239)
10:39:49.263 [I] Deferred registration for item: "params". Adding handler <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7ab6d2237610> for item "params" and save args <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'> and restore args <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'> to _handler_registry. (12059:composite_checkpoint_handler.py:239)
10:39:49.263 [I] Deferred registration for item: "metrics". Adding handler <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7ab6c7649410> for item "metrics" and save args <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'> and restore args <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'> to _handler_registry. (12059:composite_checkpoint_handler.py:239)
10:39:49.263 [I] Initialized registry DefaultCheckpointHandlerRegistry({('assets', <class 'openpi.training.checkpoints.CallbackSave'>): <openpi.training.checkpoints.CallbackHandler object at 0x7ab6c78cfb50>, ('assets', <class 'openpi.training.checkpoints.CallbackRestore'>): <openpi.training.checkpoints.CallbackHandler object at 0x7ab6c78cfb50>, ('train_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7ab6c784c450>, ('train_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7ab6c784c450>, ('params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7ab6d2237610>, ('params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7ab6d2237610>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7ab6c7649410>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7ab6c7649410>}). (12059:composite_checkpoint_handler.py:508)
10:39:49.264 [I] orbax-checkpoint version: 0.11.1 (12059:abstract_checkpointer.py:35)
10:39:49.264 [I] [process=0][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>.<lambda> at 0x7ab6c755a200> timeout: 7200 secs and primary_host=0 for async checkpoint writes (12059:async_checkpointer.py:80)
10:39:49.264 [I] Found 0 checkpoint steps in /home/yejiangchen/Desktop/Codes/openpi-main/checkpoints/pi0_aloha_cable_sort/5_10 (12059:checkpoint_manager.py:1528)
10:39:49.264 [I] Saving root metadata (12059:checkpoint_manager.py:1569)
10:39:49.264 [I] [process=0][thread=MainThread] Skipping global process sync, barrier name: CheckpointManager:save_metadata (12059:multihost.py:293)
10:39:49.264 [I] [process=0][thread=MainThread] CheckpointManager created, primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=1, max_to_keep=1, keep_time_interval=None, keep_period=5000, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=False, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=AsyncOptions(timeout_secs=7200, barrier_sync_fn=None, post_finalization_callback=None, create_directories_asynchronously=False), multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None), root_directory=/home/yejiangchen/Desktop/Codes/openpi-main/checkpoints/pi0_aloha_cable_sort/5_10: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x7ab6c7817790> (12059:checkpoint_manager.py:797)
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: yejiangchen (yejiangchen-Nanjing University of Aeronautics and Astron). Use wandb login --relogin to force relogin
wandb: Tracking run with wandb version 0.19.1
wandb: Run data is saved locally in /home/yejiangchen/Desktop/Codes/openpi-main/wandb/run-20250527_103950-7638hqw4
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run 5_10
wandb: ⭐️ View project at https://wandb.ai/yejiangchen-Nanjing%20University%20of%20Aeronautics%20and%20Astron/openpi
wandb: 🚀 View run at https://wandb.ai/yejiangchen-Nanjing%20University%20of%20Aeronautics%20and%20Astron/openpi/runs/7638hqw4
10:39:51.530 [I] Loaded norm stats from s3://openpi-assets/checkpoints/pi0_base/assets/trossen (12059:config.py:166)
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49/49 [00:00<00:00, 306290.46it/s]
Traceback (most recent call last):File "/home/yejiangchen/Desktop/Codes/openpi-main/scripts/train.py", line 273, in <module>main(_config.cli())File "/home/yejiangchen/Desktop/Codes/openpi-main/scripts/train.py", line 226, in mainbatch = next(data_iter)^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/src/openpi/training/data_loader.py", line 176, in __iter__for batch in self._data_loader:File "/home/yejiangchen/Desktop/Codes/openpi-main/src/openpi/training/data_loader.py", line 256, in __iter__batch = next(data_iter)^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 708, in __next__data = self._next_data()^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1480, in _next_datareturn self._process_data(data)^^^^^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1505, in _process_datadata.reraise()File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torch/_utils.py", line 733, in reraiseraise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loopdata = fetcher.fetch(index) # type: ignore[possibly-undefined]^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetchdata = [self.dataset[idx] for idx in possibly_batched_index]^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>data = [self.dataset[idx] for idx in possibly_batched_index]~~~~~~~~~~~~^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/src/openpi/training/data_loader.py", line 47, in __getitem__return self._transform(self._dataset[index])^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/src/openpi/transforms.py", line 70, in __call__data = transform(data)^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/src/openpi/transforms.py", line 101, in __call__return jax.tree.map(lambda k: flat_item[k], self.structure)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/jax/_src/tree.py", line 155, in mapreturn tree_util.tree_map(f, tree, *rest, is_leaf=is_leaf)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/jax/_src/tree_util.py", line 358, in tree_mapreturn treedef.unflatten(f(*xs) for xs in zip(*all_leaves))^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/jax/_src/tree_util.py", line 358, in <genexpr>return treedef.unflatten(f(*xs) for xs in zip(*all_leaves))^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/src/openpi/transforms.py", line 101, in <lambda>return jax.tree.map(lambda k: flat_item[k], self.structure)~~~~~~~~~^^^
KeyError: 'observation.images.cam_high'wandb: 🚀 View run 5_10 at: https://wandb.ai/yejiangchen-Nanjing University of Aeronautics and Astron/openpi/runs/7638hqw4
wandb: Find logs at: wandb/run-20250527_103950-7638hqw4/logs
transforms(openpi/transforms.py 第 101 行)在尝试访问 sample 数据的 flat_item['observation.images.cam_high'] 字段。但是本地数据 sample 里并没有 observation.images.cam_high 这个 key
修改 config.py,切记采集数据配置和 TrainConfig 中一致!
我是基于 lerobot 源码做的数据采集,此处相机及机器人状态命名要完全一致!
TrainConfig(name="pi0_aloha_cable_sort",model=pi0.Pi0Config(),data=LeRobotAlohaDataConfig(repo_id="datasets/collection_coil",assets=AssetsConfig(assets_dir="s3://openpi-assets/checkpoints/pi0_base/assets",asset_id="trossen",),default_prompt="sort the cable",repack_transforms=_transforms.Group(inputs=[_transforms.RepackTransform({"images": {"cam_high": "observation.images.cam_low","cam_left_wrist": "observation.images.cam_left_wrist","cam_right_wrist": "observation.images.cam_right_wrist",},"state": "observation.state","actions": "action",})]),base_config=DataConfig(local_files_only=False, # Set to True for local-only datasets.),),weight_loader=weight_loaders.CheckpointWeightLoader("s3://openpi-assets/checkpoints/pi0_base/params"),num_train_steps=20_000,),
2.6 ValueError: operands could not be broadcast together with shapes (14,) (18,)
state = _joint_flip_mask() * state
(pi0) yejiangchen@yejiangchen:~/Desktop/Codes/openpi-main$ XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py pi0_aloha_cable_sort --exp-name=5_10 --overwrite
/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/tyro/_parsers.py:332: UserWarning: The field model.action-expert-variant is annotated with type typing.Literal['dummy', 'gemma_300m', 'gemma_2b', 'gemma_2b_lora'], but the default value gemma_300m_lora has type <class 'str'>. We'll try to handle this gracefully, but it may cause unexpected behavior.warnings.warn(message)
10:42:24.339 [I] Running on: yejiangchen (12602:train.py:195)
INFO:2025-05-27 10:42:24,489:jax._src.xla_bridge:945: Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
10:42:24.489 [I] Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig' (12602:xla_bridge.py:945)
INFO:2025-05-27 10:42:24,490:jax._src.xla_bridge:945: Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
10:42:24.490 [I] Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory (12602:xla_bridge.py:945)
10:42:24.710 [I] Wiped checkpoint directory /home/yejiangchen/Desktop/Codes/openpi-main/checkpoints/pi0_aloha_cable_sort/5_10 (12602:checkpoints.py:25)
10:42:24.710 [I] Created BasePyTreeCheckpointHandler: pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=None (12602:base_pytree_checkpoint_handler.py:332)
10:42:24.710 [I] Created BasePyTreeCheckpointHandler: pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=None (12602:base_pytree_checkpoint_handler.py:332)
10:42:24.710 [I] [thread=MainThread] Failed to get flag value for EXPERIMENTAL_ORBAX_USE_DISTRIBUTED_PROCESS_ID. (12602:multihost.py:375)
10:42:24.710 [I] [process=0][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'assets': <openpi.training.checkpoints.CallbackHandler object at 0x788465ebba10>, 'train_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x788465c60290>, 'params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x788465bb5d50>}, handler_registry=None (12602:checkpoint_manager.py:622)
10:42:24.710 [I] Deferred registration for item: "assets". Adding handler <openpi.training.checkpoints.CallbackHandler object at 0x788465ebba10> for item "assets" and save args <class 'openpi.training.checkpoints.CallbackSave'> and restore args <class 'openpi.training.checkpoints.CallbackRestore'> to _handler_registry. (12602:composite_checkpoint_handler.py:239)
10:42:24.710 [I] Deferred registration for item: "train_state". Adding handler <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x788465c60290> for item "train_state" and save args <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'> and restore args <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'> to _handler_registry. (12602:composite_checkpoint_handler.py:239)
10:42:24.710 [I] Deferred registration for item: "params". Adding handler <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x788465bb5d50> for item "params" and save args <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'> and restore args <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'> to _handler_registry. (12602:composite_checkpoint_handler.py:239)
10:42:24.710 [I] Deferred registration for item: "metrics". Adding handler <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x788465d50610> for item "metrics" and save args <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'> and restore args <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'> to _handler_registry. (12602:composite_checkpoint_handler.py:239)
10:42:24.710 [I] Initialized registry DefaultCheckpointHandlerRegistry({('assets', <class 'openpi.training.checkpoints.CallbackSave'>): <openpi.training.checkpoints.CallbackHandler object at 0x788465ebba10>, ('assets', <class 'openpi.training.checkpoints.CallbackRestore'>): <openpi.training.checkpoints.CallbackHandler object at 0x788465ebba10>, ('train_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x788465c60290>, ('train_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x788465c60290>, ('params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x788465bb5d50>, ('params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x788465bb5d50>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x788465d50610>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x788465d50610>}). (12602:composite_checkpoint_handler.py:508)
10:42:24.710 [I] orbax-checkpoint version: 0.11.1 (12602:abstract_checkpointer.py:35)
10:42:24.710 [I] [process=0][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>.<lambda> at 0x788465a6a200> timeout: 7200 secs and primary_host=0 for async checkpoint writes (12602:async_checkpointer.py:80)
10:42:24.710 [I] Found 0 checkpoint steps in /home/yejiangchen/Desktop/Codes/openpi-main/checkpoints/pi0_aloha_cable_sort/5_10 (12602:checkpoint_manager.py:1528)
10:42:24.710 [I] Saving root metadata (12602:checkpoint_manager.py:1569)
10:42:24.711 [I] [process=0][thread=MainThread] Skipping global process sync, barrier name: CheckpointManager:save_metadata (12602:multihost.py:293)
10:42:24.711 [I] [process=0][thread=MainThread] CheckpointManager created, primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=1, max_to_keep=1, keep_time_interval=None, keep_period=5000, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=False, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=AsyncOptions(timeout_secs=7200, barrier_sync_fn=None, post_finalization_callback=None, create_directories_asynchronously=False), multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None), root_directory=/home/yejiangchen/Desktop/Codes/openpi-main/checkpoints/pi0_aloha_cable_sort/5_10: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x78858ab859d0> (12602:checkpoint_manager.py:797)
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: yejiangchen (yejiangchen-Nanjing University of Aeronautics and Astron). Use wandb login --relogin to force relogin
wandb: Tracking run with wandb version 0.19.1
wandb: Run data is saved locally in /home/yejiangchen/Desktop/Codes/openpi-main/wandb/run-20250527_104225-aypbxjp6
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run 5_10
wandb: ⭐️ View project at https://wandb.ai/yejiangchen-Nanjing%20University%20of%20Aeronautics%20and%20Astron/openpi
wandb: 🚀 View run at https://wandb.ai/yejiangchen-Nanjing%20University%20of%20Aeronautics%20and%20Astron/openpi/runs/aypbxjp6
10:42:26.875 [I] Loaded norm stats from s3://openpi-assets/checkpoints/pi0_base/assets/trossen (12602:config.py:166)
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49/49 [00:00<00:00, 249721.62it/s]
Traceback (most recent call last):File "/home/yejiangchen/Desktop/Codes/openpi-main/scripts/train.py", line 273, in <module>main(_config.cli())File "/home/yejiangchen/Desktop/Codes/openpi-main/scripts/train.py", line 226, in mainbatch = next(data_iter)^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/src/openpi/training/data_loader.py", line 176, in __iter__for batch in self._data_loader:File "/home/yejiangchen/Desktop/Codes/openpi-main/src/openpi/training/data_loader.py", line 256, in __iter__batch = next(data_iter)^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 708, in __next__data = self._next_data()^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1480, in _next_datareturn self._process_data(data)^^^^^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1505, in _process_datadata.reraise()File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torch/_utils.py", line 733, in reraiseraise exception
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loopdata = fetcher.fetch(index) # type: ignore[possibly-undefined]^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetchdata = [self.dataset[idx] for idx in possibly_batched_index]^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>data = [self.dataset[idx] for idx in possibly_batched_index]~~~~~~~~~~~~^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/src/openpi/training/data_loader.py", line 47, in __getitem__return self._transform(self._dataset[index])^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/src/openpi/transforms.py", line 70, in __call__data = transform(data)^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/src/openpi/policies/aloha_policy.py", line 46, in __call__data = _decode_aloha(data, adapt_to_pi=self.adapt_to_pi)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/src/openpi/policies/aloha_policy.py", line 167, in _decode_alohastate = _decode_state(state, adapt_to_pi=adapt_to_pi)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/yejiangchen/Desktop/Codes/openpi-main/src/openpi/policies/aloha_policy.py", line 188, in _decode_statestate = _joint_flip_mask() * state~~~~~~~~~~~~~~~~~~~^~~~~~~
ValueError: operands could not be broadcast together with shapes (14,) (18,) wandb: 🚀 View run 5_10 at: https://wandb.ai/yejiangchen-Nanjing University of Aeronautics and Astron/openpi/runs/aypbxjp6
wandb: Find logs at: wandb/run-20250527_104225-aypbxjp6/logs
简单分析一下问题:
- state 是从数据 sample 里读取的状态向量,shape 是 (14,)
- _joint_flip_mask() 返回的是 shape (18,)
- 两者 shape 不一致,不能逐元素相乘(broadcast)
由于我使用的是 ALOHA,状态向量本就应该是18,但报错 "state" 向量有 14 个元素,所以先查找_joint_flip_mask():
def _joint_flip_mask() -> np.ndarray:"""Used to convert between aloha and pi joint angles."""return np.array([1, -1, -1, 1, 1, 1, 1, 1, -1, -1, 1, 1, 1, 1])
位于 openpi-main/src/openpi/policies/aloha_policy.py
其实,数据集 meta 文件已经明确 state 和 action 都是 18 维,feature 名字也完全和 ALOHA-18 关节序一致
核心问题在于 _joint_flip_mask() 返回了 14 维的 mask,14 =7+7 为双臂 7 自由度机器人使用
因此,需要把 _joint_flip_mask() 里根据 ALOHA 写成 18 维
ALOHA 数据里面的 info.json:
"names": ["left_waist","left_shoulder","left_shoulder_shadow","left_elbow","left_elbow_shadow","left_forearm_roll","left_wrist_angle","left_wrist_rotate","left_gripper","right_waist","right_shoulder","right_shoulder_shadow","right_elbow","right_elbow_shadow","right_forearm_roll","right_wrist_angle","right_wrist_rotate","right_gripper"]
因此,_joint_flip_mask() 修改为:
def _joint_flip_mask():# 正确的长度和顺序,18维return np.array([1, # left_waist-1, # left_shoulder1, # left_shoulder_shadow-1, # left_elbow1, # left_elbow_shadow1, # left_forearm_roll1, # left_wrist_angle1, # left_wrist_rotate1, # left_gripper1, # right_waist-1, # right_shoulder1, # right_shoulder_shadow-1, # right_elbow1, # right_elbow_shadow1, # right_forearm_roll1, # right_wrist_angle1, # right_wrist_rotate1 # right_gripper])
1 和 -1 在 _joint_flip_mask 里的作用:
- 物理含义: 这通常用于关节对称/方向归一化。在很多机器人(比如 ALOHA 双臂)中,左右手有些关节物理意义完全对称,比如左臂+1度就是往外,右臂+1度可能是往内。为了方便网络处理和学习,可以用一个 ±1 mask 把所有关节角的正负方向统一成“机器人全局一致的参考系”。简单来说,就是让模型看到的数据总是“左边抬起来”都记作 +,不管是哪只手,这样模型不用区分符号。
- 实现原理:mask 的每一位代表对应关节的“正向是否需要翻转”,比如:1 :该关节的正负号与全局标准一致;-1:该关节的正负号需要取反(比如右臂和左臂同一自由度物理方向相反)
- 对数据的影响:归一化后,所有样本的关节方向符号都被标准化了,模型可以直接泛化左右手的运动,不用显式学会“左/右手操作符号反了”这种细节。如果不做,模型需要分别学两套“镜像”数据,浪费参数
再次运行,开始下载数据,成功
(pi0) yejiangchen@yejiangchen:~/Desktop/Codes/openpi-main$ XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py pi0_aloha_cable_sort --exp-name=5_10 --overwrite
/home/yejiangchen/Desktop/Codes/openpi-main/.venv/lib/python3.11/site-packages/tyro/_parsers.py:332: UserWarning: The field `model.action-expert-variant` is annotated with type `typing.Literal['dummy', 'gemma_300m', 'gemma_2b', 'gemma_2b_lora']`, but the default value `gemma_300m_lora` has type `<class 'str'>`. We'll try to handle this gracefully, but it may cause unexpected behavior.warnings.warn(message)
10:49:23.927 [I] Running on: yejiangchen (14672:train.py:195)
INFO:2025-05-27 10:49:24,097:jax._src.xla_bridge:945: Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
10:49:24.097 [I] Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig' (14672:xla_bridge.py:945)
INFO:2025-05-27 10:49:24,098:jax._src.xla_bridge:945: Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
10:49:24.098 [I] Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory (14672:xla_bridge.py:945)
10:49:24.313 [I] Wiped checkpoint directory /home/yejiangchen/Desktop/Codes/openpi-main/checkpoints/pi0_aloha_cable_sort/5_10 (14672:checkpoints.py:25)
10:49:24.313 [I] Created BasePyTreeCheckpointHandler: pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=None (14672:base_pytree_checkpoint_handler.py:332)
10:49:24.313 [I] Created BasePyTreeCheckpointHandler: pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=None (14672:base_pytree_checkpoint_handler.py:332)
10:49:24.313 [I] [thread=MainThread] Failed to get flag value for EXPERIMENTAL_ORBAX_USE_DISTRIBUTED_PROCESS_ID. (14672:multihost.py:375)
10:49:24.313 [I] [process=0][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'assets': <openpi.training.checkpoints.CallbackHandler object at 0x7b3d281cf910>, 'train_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b3d28141dd0>, 'params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b3d2839cf90>}, handler_registry=None (14672:checkpoint_manager.py:622)
10:49:24.313 [I] Deferred registration for item: "assets". Adding handler `<openpi.training.checkpoints.CallbackHandler object at 0x7b3d281cf910>` for item "assets" and save args `<class 'openpi.training.checkpoints.CallbackSave'>` and restore args `<class 'openpi.training.checkpoints.CallbackRestore'>` to `_handler_registry`. (14672:composite_checkpoint_handler.py:239)
10:49:24.313 [I] Deferred registration for item: "train_state". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b3d28141dd0>` for item "train_state" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`. (14672:composite_checkpoint_handler.py:239)
10:49:24.313 [I] Deferred registration for item: "params". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b3d2839cf90>` for item "params" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`. (14672:composite_checkpoint_handler.py:239)
10:49:24.313 [I] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7b3d28143f10>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`. (14672:composite_checkpoint_handler.py:239)
10:49:24.313 [I] Initialized registry DefaultCheckpointHandlerRegistry({('assets', <class 'openpi.training.checkpoints.CallbackSave'>): <openpi.training.checkpoints.CallbackHandler object at 0x7b3d281cf910>, ('assets', <class 'openpi.training.checkpoints.CallbackRestore'>): <openpi.training.checkpoints.CallbackHandler object at 0x7b3d281cf910>, ('train_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b3d28141dd0>, ('train_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b3d28141dd0>, ('params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b3d2839cf90>, ('params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b3d2839cf90>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7b3d28143f10>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7b3d28143f10>}). (14672:composite_checkpoint_handler.py:508)
10:49:24.313 [I] orbax-checkpoint version: 0.11.1 (14672:abstract_checkpointer.py:35)
10:49:24.313 [I] [process=0][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>.<lambda> at 0x7b3d28026340> timeout: 7200 secs and primary_host=0 for async checkpoint writes (14672:async_checkpointer.py:80)
10:49:24.313 [I] Found 0 checkpoint steps in /home/yejiangchen/Desktop/Codes/openpi-main/checkpoints/pi0_aloha_cable_sort/5_10 (14672:checkpoint_manager.py:1528)
10:49:24.313 [I] Saving root metadata (14672:checkpoint_manager.py:1569)
10:49:24.313 [I] [process=0][thread=MainThread] Skipping global process sync, barrier name: CheckpointManager:save_metadata (14672:multihost.py:293)
10:49:24.313 [I] [process=0][thread=MainThread] CheckpointManager created, primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=1, max_to_keep=1, keep_time_interval=None, keep_period=5000, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=False, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=AsyncOptions(timeout_secs=7200, barrier_sync_fn=None, post_finalization_callback=None, create_directories_asynchronously=False), multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None), root_directory=/home/yejiangchen/Desktop/Codes/openpi-main/checkpoints/pi0_aloha_cable_sort/5_10: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x7b3d2837f950> (14672:checkpoint_manager.py:797)
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: yejiangchen (yejiangchen-Nanjing University of Aeronautics and Astron). Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.19.1
wandb: Run data is saved locally in /home/yejiangchen/Desktop/Codes/openpi-main/wandb/run-20250527_104926-s3yfuee4
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run 5_10
wandb: ⭐️ View project at https://wandb.ai/yejiangchen-Nanjing%20University%20of%20Aeronautics%20and%20Astron/openpi
wandb: 🚀 View run at https://wandb.ai/yejiangchen-Nanjing%20University%20of%20Aeronautics%20and%20Astron/openpi/runs/s3yfuee4
10:49:27.123 [I] Loaded norm stats from s3://openpi-assets/checkpoints/pi0_base/assets/trossen (14672:config.py:166)
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49/49 [00:00<00:00, 546598.13it/s]
10:49:35.856 [I] Initialized data loader:
[0].images['base_0_rgb']: (32, 224, 224, 3)@float32
[0].images['left_wrist_0_rgb']: (32, 224, 224, 3)@float32
[0].images['right_wrist_0_rgb']: (32, 224, 224, 3)@float32
[0].image_masks['base_0_rgb']: (32,)@bool
[0].image_masks['left_wrist_0_rgb']: (32,)@bool
[0].image_masks['right_wrist_0_rgb']: (32,)@bool
[0].state: (32, 32)@float32
[0].tokenized_prompt: (32, 48)@int32
[0].tokenized_prompt_mask: (32, 48)@bool
[1]: (32, 50, 32)@float32 (14672:train.py:227)
10:49:36.237 [I] Downloading s3://openpi-assets/checkpoints/pi0_base/params to /home/yejiangchen/.cache/openpi/openpi-assets/checkpoints/pi0_base/params (14672:download.py:93)0%|▎ | 14.5M/11.2G [00:10<1:18:53, 2.53MiB/s] 0%|▎ | 14.8M/11.2G [00:10<2:17:04, 1.46MiB/s]