当前位置: 首页 > backend >正文

Unit 3 训练一个Q-Learning智能体 Frozen-Lake-v1

Frozen-Lake-v1环境描述

“冰冻湖”(Frozen Lake)环境要求玩家从起点穿越到终点,过程中不能掉入任何洞中,只能通过在冰冻的湖面上行走来完成。由于冰冻湖面的湿滑特性,玩家可能并不总是能按照预想的方向移动。游戏开始时,玩家位于冰冻湖网格世界的起始位置 [0,0],而目标位置则位于世界的另一端,例如在 4x4 的环境中为 [3,3]。当使用预先设定的地图时,冰面上的洞会分布在固定的位置;而当生成随机地图时,洞则会随机分布在各个位置。玩家会不断移动,直到到达目标位置或掉入洞中为止。由于湖面湿滑(除非禁用此特性),玩家有时可能会朝着与预想方向垂直的方向移动(请参见 is_slippery 参数)。随机生成的世界中总会存在一条通往目标的路径。
在这里插入图片描述

动作空间(Action Space)

动作的形状为 (1,),其取值范围在 {0, 3} 之间,表示玩家移动的方向。

  • 0:向左移动
  • 1:向下移动
  • 2:向右移动
  • 3:向上移动
观测空间(Observation Space)

观测值是一个表示玩家当前位置的数值,计算方式为 current_row * ncols + current_col(其中行和列的编号均从 0 开始)。例如,在 4x4 的地图中,目标位置的观测值可以这样计算:3 * 4 + 3 = 15。可能的观测值数量取决于地图的大小。观测值以整数(int())形式返回。

起始状态(Starting State)

每个回合(episode)开始时,玩家处于状态 [0](即位置 [0, 0])。

奖励(Rewards)

奖励机制如下:

  • 到达目标:+1
  • 掉入洞中:0
  • 停留在冰冻湖面上(未到达目标也未掉入洞中):0
回合结束(Episode End)

当以下情况发生时,回合结束:

  • 终止条件(Termination):玩家掉入洞中。
  • 玩家到达目标位置,该位置为 max(nrow) * max(ncol) - 1(即位置 [max(nrow)-1, max(ncol)-1])。
  • 截断条件(Truncation,当使用 time_limit 包装器时):
    • 对于 4x4 的环境,回合长度为 100。
    • 对于 FrozenLake8x8-v1 环境,回合长度为 200。
信息(Information)

step()reset() 函数返回一个包含以下键的字典:

  • p - 状态的转移概率。
  • 有关转移概率的详细信息,请参见 is_slippery 参数的说明。
参数 (Arguments)
import gymnasium as gym
gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=True)
  • “S” for Start tile
  • “G” for Goal tile
  • “F” for frozen tile
  • “H” for a tile with a hole

随机地图生成:

from gymnasium.envs.toy_text.frozen_lake import generate_random_map
gym.make('FrozenLake-v1', desc=generate_random_map(size=8))

is_slippery=True:如果设置为 True,玩家将以 1/3 的概率按照预想的方向移动,否则将以相等的概率(即每个垂直方向各 1/3 的概率)朝着与预想方向垂直的其中一个方向移动。

Frozen Lake ⛄ (non slippery version)

训练Q-Learning智能体,使其仅在冰冻的格子(F)上行走,避开洞穴(H),从而从起始状态(S)导航到目标状态(G)。

有两种尺寸的环境:

  • map_name=“4x4”:4x4 的网格版本
  • map_name=“8x8”:8x8 的网格版本

该环境有两种模式:

  • is_slippery=False:由于冰冻湖面不滑,智能体总是会按照预想的方向移动(确定性环境)。
  • is_slippery=True:由于冰冻湖面湿滑,智能体可能并不总是会按照预想的方向移动(随机性环境)。

先从简单的 4x4 地图和不滑模式开始。我们添加了一个名为 render_mode = "rgb_array" 的参数,用于指定环境的可视化方式。“rgb_array”:返回一个表示环境当前状态的单帧图像。这个帧是一个形状为 (x, y, 3) 的 np.ndarray,代表一个 x 行 y 列像素图像的 RGB 值。

import os
import tqdm
import random   # To generate random numbers
import imageio  # To generate a replay video
import numpy as np
import gymnasium as gym
import pickle5 as pickle  # Save/Load model
from tqdm.notebook import tqdm
# 创建环境
env = gym.make("FrozenLake-v1", map_name="8x8", is_slippery=False, render_mode="rgb_array")
print("_____OBSERVATION SPACE_____ \n")
print("Observation Space", env.observation_space)
print("Sample observation", env.observation_space.sample())  # Get a random observation
_____OBSERVATION SPACE_____ Observation Space Discrete(64)
Sample observation 35
print("\n _____ACTION SPACE_____ \n")
print("Action Space Shape", env.action_space.n)
print("Action Space Sample", env.action_space.sample())  # Take a random action
 _____ACTION SPACE_____ Action Space Shape 4
Action Space Sample 1
创建并初始化Q-Table
state_space = env.observation_space.n
action_space = env.action_space.ndef initialize_q_table(state_space, action_space):Qtable = np.zeros((state_space, action_space))return QtableQtable_frozenlake = initialize_q_table(state_space, action_space)print("Q-Table :\n", Qtable_frozenlake)
Q-Table :[[0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.]]
定义贪心策略(updating policy)

当 Q 学习智能体完成训练后,我们最终所采用的策略也将是贪婪策略。贪婪策略用于利用 Q 表来选择动作。

def greedy_policy(Qtable, state):# Exploitation: take the action with the highest state, action valueaction = np.argmax(Qtable[state][:])return action
定义 ϵ \epsilon ϵ-贪心策略(acting policy)

ε-贪婪策略(Epsilon-greedy)是一种在训练过程中用于平衡探索(exploration)与利用(exploitation)权衡的训练策略。
ε-贪婪策略的理念如下:

  • 以概率 1 - ε:我们进行利用(即智能体选择具有最高状态-动作对值的动作)。
  • 以概率 ε:我们进行探索(尝试一个随机动作)。

随着训练的持续进行,我们会逐渐减小 ε 的值,因为随着训练的深入,我们需要的探索会越来越少,而利用则会越来越多。

def epsilon_greedy_policy(Qtable, state, epsilon):# Randomly generate a number between 0 and 1random_num = random.uniform(0, 1)# if random_num > greater than epsilon --> exploitationif random_num > epsilon:# Take the action with the highest value given a stateaction = greedy_policy(Qtable, state)# else --> explorationelse:action = env.action_space.sample()return action
定义超参数

探索(exploration)相关的超参数是其中最为重要的,我们需要确保智能体能够充分探索状态空间,以学习到一个良好的价值近似。为了实现这一点,我们需要让 ε 值(探索率)逐渐衰减。如果你将 ε 值衰减得过快(即衰减率设置得过高),你的智能体可能会陷入困境,因为它没有充分探索状态空间,从而无法解决问题。

# Training parameters
n_training_episodes = 10000  # Total training episodes
learning_rate = 0.7          # Learning rate# Evaluation parameters
n_eval_episodes = 100        # Total number of test episodes# Environment parameters
env_id = "FrozenLake-v1"     # Name of the environment
max_steps = 200              # Max steps per episode
gamma = 0.95                 # Discounting rate
eval_seed = []               # The evaluation seed of the environment# Exploration parameters
max_epsilon = 1.0            # Exploration probability at start
min_epsilon = 0.05           # Minimum exploration probability
decay_rate = 0.0001          # Exponential decay rate for exploration prob
训练Agent

余弦退火(Cosine Annealing)是一种在深度学习和强化学习中常用的学习率或探索率调整策略,其特点在于能够产生一种“先慢后快”的衰减效果,非常适合用于探索率的衰减。余弦退火衰减函数基于余弦函数的性质,其公式可以表示为:
ϵ t = ϵ m i n + 1 2 ( ϵ m a x − ϵ m i n ) ( 1 + cos ⁡ ( t π T ) ) \epsilon_t = \epsilon_{min} + \frac{1}{2}(\epsilon_{max} - \epsilon_{min})(1 + \cos(\frac{t\pi}{T})) ϵt=ϵmin+21(ϵmaxϵmin)(1+cos(Ttπ))

def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):for episode in tqdm(range(n_training_episodes)):# Reduce epsilon (because we need less and less exploration)#epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)epsilon = min_epsilon + (max_epsilon - min_epsilon)*0.5 * (1 + np.cos(episode*np.pi/n_training_episodes))# Reset the environmentstate, info = env.reset()step = 0terminated = Falsetruncated = False# repeatfor step in range(max_steps):# Choose the action At using epsilon greedy policyaction = epsilon_greedy_policy(Qtable, state, epsilon)# Take action At and observe Rt+1 and St+1# Take the action (a) and observe the outcome state(s') and reward (r)new_state, reward, terminated, truncated, info = env.step(action)# Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]Qtable[state][action] = Qtable[state][action] + learning_rate * (reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action])# If terminated or truncated finish the episodeif terminated or truncated:break# Our next state is the new statestate = new_statereturn Qtable
Qtable_frozenlake = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_frozenlake)
print("Q table:\n",Qtable_frozenlake)
  0%|          | 0/10000 [00:00<?, ?it/s]Q table:[[0.48767498 0.51334208 0.51334208 0.48767498][0.48767498 0.54036009 0.54036009 0.51334208][0.51334208 0.56880009 0.56880009 0.54036009][0.54036009 0.59873694 0.59873694 0.56880009][0.56880009 0.63024941 0.63024941 0.59873694][0.59873694 0.66342043 0.66342043 0.63024941][0.63024941 0.6983373  0.6983373  0.66342043][0.66342043 0.73509189 0.6983373  0.6983373 ][0.51334208 0.54036009 0.54036009 0.48767498][0.51334208 0.56880009 0.56880009 0.51334208][0.54036009 0.59873694 0.59873694 0.54036009][0.56880009 0.         0.63024941 0.56880009][0.59873694 0.66342043 0.66342043 0.59873694][0.63024941 0.6983373  0.6983373  0.63024941][0.66342043 0.73509189 0.73509189 0.66342043][0.6983373  0.77378094 0.73509189 0.6983373 ][0.54036009 0.56880009 0.56880009 0.51334208][0.54036009 0.59873694 0.59873694 0.54036009][0.56880009 0.63024941 0.         0.56880009][0.         0.         0.         0.        ][0.         0.6983373  0.6983373  0.63024941][0.66342043 0.         0.73509189 0.66342043][0.6983373  0.77378094 0.77378094 0.6983373 ][0.73509189 0.81450625 0.77378094 0.73509189][0.56880009 0.54036009 0.59873694 0.54036009][0.56880009 0.56880009 0.63024941 0.56880009][0.59873694 0.59873694 0.66342043 0.59873694][0.63024941 0.         0.6983373  0.        ][0.66342043 0.73509189 0.         0.66342043][0.         0.         0.         0.        ][0.         0.81450625 0.81450625 0.73509189][0.77378094 0.857375   0.81450625 0.77378094][0.54036009 0.51334208 0.56880009 0.56880009][0.54036009 0.         0.59873694 0.59873694][0.56880009 0.         0.         0.63024941][0.         0.         0.         0.        ][0.         0.6983373  0.77378094 0.6983373 ][0.73509189 0.73509189 0.81450625 0.        ][0.77378094 0.         0.857375   0.77378094][0.81450625 0.9025     0.857375   0.81450625][0.51334208 0.48767498 0.         0.54036009][0.         0.         0.         0.        ][0.         0.         0.         0.        ][0.         0.         0.6983373  0.        ][0.66342043 0.         0.73509189 0.73509189][0.6983373  0.6983373  0.         0.77378094][0.         0.         0.         0.        ][0.         0.95       0.9025     0.857375  ][0.48762436 0.46329118 0.         0.51334208][0.         0.         0.         0.        ][0.         0.         0.         0.        ][0.         0.         0.         0.        ][0.         0.         0.         0.        ][0.         0.66337945 0.         0.73509189][0.         0.         0.         0.        ][0.         1.         0.95       0.9025    ][0.46329049 0.46328851 0.41401517 0.48767498][0.4598555  0.28526749 0.         0.        ][0.         0.         0.         0.        ][0.         0.         0.         0.        ][0.         0.         0.         0.        ][0.         0.52391649 0.         0.69833488][0.         0.         0.         0.        ][0.         0.         0.         0.        ]]
评估Agent
def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):"""Evaluate the agent for ``n_eval_episodes`` episodes and returns average reward and std of reward.:param env: The evaluation environment:param n_eval_episodes: Number of episode to evaluate the agent:param Q: The Q-table:param seed: The evaluation seed array (for taxi-v3)"""episode_rewards = []for episode in tqdm(range(n_eval_episodes)):if seed:state, info = env.reset(seed=seed[episode])else:state, info = env.reset()step = 0truncated = Falseterminated = Falsetotal_rewards_ep = 0for step in range(max_steps):# Take the action (index) that have the maximum expected future reward given that stateaction = greedy_policy(Q, state)new_state, reward, terminated, truncated, info = env.step(action)total_rewards_ep += rewardif terminated or truncated:breakstate = new_stateepisode_rewards.append(total_rewards_ep)mean_reward = np.mean(episode_rewards)std_reward = np.std(episode_rewards)return mean_reward, std_reward
# Evaluate our Agent
mean_reward, std_reward = evaluate_agent(env, max_steps, n_eval_episodes, Qtable_frozenlake, eval_seed)
print(f"Mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")
  0%|          | 0/100 [00:00<?, ?it/s]Mean_reward=1.00 +/- 0.00

可视化Q-L Agent:

env = gym.wrappers.RecordVideo(env, video_folder="./FrozenLake-v1-QL",disable_logger=True,fps=30)
state, info = env.reset()
for step in range(max_steps):action = greedy_policy(Qtable_frozenlake, state)state, reward, terminated, truncated, info = env.step(action)if terminated == True:break
env.close()

在这里插入图片描述

Frozen Lake ⛄ (slippery version)

# 创建环境
slippery_env = gym.make("FrozenLake-v1", map_name="8x8", is_slippery=True, render_mode="rgb_array")
# 初始化Q-Table
SQtable_frozenlake = initialize_q_table(slippery_env.observation_space.n, slippery_env.action_space.n)print("Q-Table :\n", SQtable_frozenlake)
Q-Table :[[0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.]]
# Training parameters
n_training_episodes = 30000  # Total training episodes
SQtable_frozenlake = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, slippery_env, max_steps, SQtable_frozenlake)
print("Slippery Q table:\n",SQtable_frozenlake)
  0%|          | 0/30000 [00:00<?, ?it/s]Slippery Q table:[[2.95092538e-02 2.98058559e-02 3.55425091e-02 2.92342470e-02][3.42323951e-02 4.00210946e-02 2.16868036e-02 2.13141622e-02][3.14652001e-02 4.41281793e-02 3.17972584e-02 6.43655693e-02][4.89825234e-02 3.73930063e-02 7.97258599e-02 4.23927897e-02][5.15487856e-02 5.45440972e-02 8.14347651e-02 6.17998290e-02][8.66710970e-02 5.41912201e-02 5.47576619e-02 5.39549992e-02][6.81525386e-02 8.96955605e-02 1.19140669e-01 6.43553105e-02][1.01704389e-01 8.04192328e-02 8.43238258e-02 8.44453576e-02][1.87306116e-02 2.36937038e-02 1.80814138e-02 3.88206186e-02][1.89657996e-02 2.16608203e-02 1.83821238e-02 4.74695342e-02][2.66255882e-02 2.70086379e-02 2.40275840e-02 7.01938623e-02][2.97640728e-02 6.13349121e-03 1.41040909e-02 7.60960874e-02][4.56302719e-02 4.00950988e-02 5.37607732e-02 4.24232515e-02][5.69634607e-02 9.01739578e-02 7.01677881e-02 5.63298290e-02][9.78335315e-02 9.06127357e-02 1.06052611e-01 7.17523463e-02][1.14012792e-01 1.40811865e-01 1.04272680e-01 7.99623572e-02][1.66456648e-02 1.62485470e-02 1.74163490e-02 3.27535019e-02][1.51419958e-02 1.35046417e-02 1.45398887e-02 1.76688930e-02][1.90295521e-02 4.25410252e-03 1.83213222e-03 5.23178872e-03][0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00][1.58533465e-02 3.67224622e-02 4.99585459e-02 1.75388507e-02][6.61441587e-02 1.10450323e-02 3.11163963e-02 8.05712877e-02][1.07919603e-01 7.22318163e-02 1.57449939e-01 8.42769872e-02][2.61324179e-01 1.16283409e-01 1.31168208e-01 1.16575381e-01][1.29772186e-02 9.80512299e-03 1.14779611e-02 1.64368035e-02][1.60933016e-02 1.31074728e-02 4.39518357e-03 1.67676482e-02][8.69393524e-03 1.04075372e-03 9.52817187e-03 1.23338489e-02][1.22384041e-05 3.33445738e-03 2.72660768e-04 4.60460192e-04][1.67221437e-02 2.29019203e-03 8.82719764e-02 3.48791688e-03][0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00][4.98473132e-02 1.14280468e-01 4.18399272e-01 1.67182467e-02][1.81395051e-01 1.72917909e-01 4.36020228e-01 1.24908431e-01][1.46880265e-02 4.16296475e-03 5.19508161e-03 9.44630730e-03][1.37255941e-03 2.32290503e-03 2.75471190e-04 1.57960307e-02][1.20043561e-02 8.62992516e-06 4.91368229e-06 9.49594317e-05][0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00][3.80016972e-04 2.01246118e-03 4.14847844e-02 1.84458017e-02][8.13313604e-02 1.43051670e-01 1.83712719e-02 4.43525885e-04][5.54783837e-03 1.17747589e-02 2.17635398e-02 4.91418874e-01][2.28295381e-01 1.92538698e-01 6.01683424e-01 2.23992518e-01][9.27879652e-03 1.10868374e-04 1.08306191e-03 2.25001087e-03][0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00][0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00][1.00862789e-05 5.52190401e-04 1.80929931e-05 7.63147199e-06][3.06042994e-04 1.37267286e-03 5.67208929e-02 2.00546095e-03][3.70648973e-02 1.78282543e-03 2.28593220e-02 2.80359293e-02][0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00][5.87143371e-02 2.15910759e-01 7.11720489e-01 1.25217126e-02][2.32287713e-03 1.36611363e-03 1.14438549e-03 1.64655400e-03][0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00][1.79431114e-05 6.40070118e-05 3.93668028e-04 1.58774176e-05][1.67603193e-04 3.36343482e-05 0.00000000e+00 2.98572949e-05][0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00][1.88518050e-04 1.03447142e-04 6.00621301e-03 2.24384032e-05][0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00][2.76773042e-01 1.49805588e-01 9.34268361e-01 5.30816981e-01][1.72881462e-03 7.73892342e-04 9.74143329e-04 8.28502800e-04][4.18900346e-04 1.36431192e-03 4.80345616e-04 4.90415880e-04][4.37120136e-04 6.63167566e-05 8.11219030e-05 1.93626697e-04][0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00][0.00000000e+00 0.00000000e+00 0.00000000e+00 2.02683032e-02][4.13766214e-01 1.66513079e-01 6.39326884e-01 4.46513193e-01][5.20579883e-01 9.53990423e-01 8.54776446e-02 1.89000000e-02][0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]]
# Evaluate our Agent
mean_reward, std_reward = evaluate_agent(slippery_env, max_steps, n_eval_episodes, SQtable_frozenlake, eval_seed)
print(f"Mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")
  0%|          | 0/100 [00:00<?, ?it/s]Mean_reward=0.37 +/- 0.48

可视化Slippery Agent:

slippery_env = gym.wrappers.RecordVideo(slippery_env, video_folder="./Slippery-FrozenLake-v1-QL",disable_logger=True,fps=30)
state, info = slippery_env.reset()
for step in range(max_steps):action = greedy_policy(SQtable_frozenlake, state)state, reward, terminated, truncated, info = slippery_env.step(action)if terminated == True:break
slippery_env.close()

Agent vs Slippery Agent

在这里插入图片描述

http://www.xdnf.cn/news/13985.html

相关文章:

  • 基于springboot视频及游戏管理系统+源码+文档+应用视频
  • RTP MOS计算:语音质量的数字评估
  • STM32HAL库发送字符串,将uint8_t数据转为字符串发送,sprintf函数的使用方法
  • 声学成像仪在电力行业的应用品牌推荐
  • Java从入门到精通 - 面向对象高级(一)
  • vllm eagle支持分析
  • 燃气从业人员资格证书:开启职业大门的 “金钥匙”
  • Ntfs!NtfsInitializeRestartTable函数分析
  • 资金分析怎么做?如何预防短期现金流风险?
  • 刀客doc:WPP走下神坛
  • 中国AI Top30 访问量排行榜 - 2025年05月
  • 外部记忆的组织艺术:集合、树、栈与队列的深度解析
  • 燃气从业人员资格证书:职业发展的 “助推器”
  • 灌区信息化智能一体化闸门系统解决方案
  • 学习STC51单片机36(芯片为STC89C52RCRC)智能小车3(PWM差速小车)
  • 【软件安装那些事 4】CAD字体 SHX格式字库 免费 下载 及 使用方法
  • python中的分支结构:单分支、多分支,switch语句
  • JeecgBoot Pro-Online表单开发
  • 【经验篇】自签名TLS证书生成
  • 博客园突发大规模DDoS攻击 - 深度解析云安全防御新范式
  • P10987 [蓝桥杯 2023 国 Python A] 火车运输
  • 第一章 数字电路概述
  • 记一次错误 深拷贝 key值全部小写
  • 三次握手建立连接,四次挥手释放连接——TCP协议的核心机制
  • 上海市计算机学会竞赛平台2022年6月月赛丙组模糊匹配
  • 蚂蚁国际计划在香港和新加坡推出稳定币
  • 关于UEFI:UEFI/BIOS 固件分析
  • 【51单片机】6. 定时器、按键切换流水灯时钟Demo
  • MFC对话框程序使用线程方式更新窗体文本框内容(vs2019)
  • 多平台联动营销:品融电商助食品品牌打造电商“多栖”增长引擎