Ray集群部署与维护
Ray集群部署与维护
一、环境准备
1.1 安装依赖
根据不同云平台,执行以下命令安装必要依赖:
AWS
pip install -U "ray[default]" boto3
GCP
pip install -U "ray[default]" google-api-python-client
Azure
pip install -U "ray[default]" azure-cli azure-core
1.2 配置云平台凭证
AWS
配置~/.aws/credentials
文件,参考AWS文档
GCP
设置环境变量:
export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json"
Azure
登录并配置订阅:
az login
az account set -s <subscription_id>
二、集群部署
2.1 创建配置文件
创建config.yaml
文件,以下是各平台的最小配置示例:
AWS
cluster_name: minimal
provider:type: awsregion: us-west1
auth:ssh_user: ubuntu
GCP
cluster_name: minimal
provider:type: gcpregion: us-west1
auth:ssh_user: ubuntu
Azure
cluster_name: minimal
provider:type: azurelocation: westus2resource_group: ray-cluster
auth:ssh_user: ubuntussh_private_key: ~/.ssh/id_rsassh_public_key: ~/.ssh/id_rsa.pub
2.2 启动集群
ray up -y config.yaml
三、集群使用
3.1 提交作业
ray exec config.yaml 'python -c "import ray; ray.init()"'
3.2 连接到集群
ray attach config.yaml
3.3 运行示例应用
创建script.py
文件:
from collections import Counter
import socket
import time
import rayray.init()print(f'''This cluster consists of{len(ray.nodes())} nodes in total{ray.cluster_resources()['CPU']} CPU resources in total
''')@ray.remote
def