当前位置: 首页 > ai >正文

torch.distributed.launch 、 torchrun 和 torch.distributed.run 无法与 nohup 兼容

问题现象:

使用nohup 启动torch的分布式训练后, 由于ssh断开与服务器的连接, 导致训练过程出错:

WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3971878 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3971879 closing signal SIGHUP
Traceback (most recent call last):File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/runpy.py", line 194, in _run_module_as_mainreturn _run_code(code, main_globals, None,File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/runpy.py", line 87, in _run_codeexec(code, run_globals)File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in <module>main()File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in mainlaunch(args)File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launchrun(args)File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in runelastic_launch(File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__return launch_agent(self._config, self._entrypoint, list(args))File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 237, in launch_agentresult = agent.run()File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapperresult = f(*args, **kwargs)File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in runresult = self._invoke_run(role)File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 850, in _invoke_runtime.sleep(monitor_interval)File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handlerraise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 3971841 got signal: 1

执行的命令如下:

nohup ./my_train.sh   >log.log 2>&1   &

报错的原因可能是torch.distributed.launch 、 torchrun 和 torch.distributed.run 无法与 nohup 兼容 , 当ssh连接断开, 窗口被关闭时,torch.distribute 接管了相关异常, 导致nohup没起作用。

ref: https://discuss.pytorch.org/t/ddp-error-torch-distributed-elastic-agent-server-api-received-1-death-signal-shutting-down-workers/135720/6

http://www.xdnf.cn/news/10847.html

相关文章:

  • 如何避免工具过多导致的效率下降
  • Java函数式编程(下)
  • 机器人开发前景洞察:现状、机遇、挑战与未来走向
  • 2024-2025-2-《移动机器人设计与实践》-复习资料-8……
  • 【基础】Unity中Camera组件知识点
  • SpringBoot 和 Spring 的区别是什么?
  • 动物超声波记录仪应用场景和厂家
  • Python训练打卡Day41
  • Spring Bean 为何“难产”?攻克构造器注入的依赖与歧义
  • AI+在线教育系统源码:开发智能化互动网校平台全流程详解
  • 【相机基础知识与物体检测】更新中
  • 【北邮 操作系统】第十三章 I/O系统
  • 高考数学易错考点01 | 临阵磨枪
  • Spine工具入门教程4之网格与权重
  • SpringAI系列 - MCP篇(三) - MCP Client Boot Starter
  • 【C++高级主题】多重继承下的类作用域
  • 面向对象系统中对象交互的架构设计哲学
  • 集成学习之Bagging,Boosting,随机森林
  • Vue3 + Vite:我的 Qiankun 微前端主子应用实践指南
  • 杭州白塔岭画室怎么样?和燕壹画室哪个好?
  • LEAP模型
  • MongoDB-6.0.24 主从复制搭建和扩容缩容详解
  • Java垃圾回收机制深度解析:从理论到实践的全方位指南
  • 【Typst】4.导入、包含和读取
  • 【算法设计与分析】实验——汽车加油问题, 删数问题(算法实现:代码,测试用例,结果分析,算法思路分析,总结)
  • 行列式详解:从定义到应用
  • 滚动部署详解
  • Qt踩坑记录
  • 【Spring AI 1.0.0】Spring AI 1.0.0框架快速入门(1)——Chat Client API
  • 湖北理元理律师事务所:法律视角下的债务优化与生活平衡之道