Unit 3 训练一个Q-Learning智能体 Frozen-Lake-v1
Frozen-Lake-v1环境描述
“冰冻湖”(Frozen Lake)环境要求玩家从起点穿越到终点,过程中不能掉入任何洞中,只能通过在冰冻的湖面上行走来完成。由于冰冻湖面的湿滑特性,玩家可能并不总是能按照预想的方向移动。游戏开始时,玩家位于冰冻湖网格世界的起始位置 [0,0],而目标位置则位于世界的另一端,例如在 4x4 的环境中为 [3,3]。当使用预先设定的地图时,冰面上的洞会分布在固定的位置;而当生成随机地图时,洞则会随机分布在各个位置。玩家会不断移动,直到到达目标位置或掉入洞中为止。由于湖面湿滑(除非禁用此特性),玩家有时可能会朝着与预想方向垂直的方向移动(请参见 is_slippery
参数)。随机生成的世界中总会存在一条通往目标的路径。
动作空间(Action Space)
动作的形状为 (1,),其取值范围在 {0, 3} 之间,表示玩家移动的方向。
- 0:向左移动
- 1:向下移动
- 2:向右移动
- 3:向上移动
观测空间(Observation Space)
观测值是一个表示玩家当前位置的数值,计算方式为 current_row * ncols + current_col
(其中行和列的编号均从 0 开始)。例如,在 4x4 的地图中,目标位置的观测值可以这样计算:3 * 4 + 3 = 15。可能的观测值数量取决于地图的大小。观测值以整数(int())形式返回。
起始状态(Starting State)
每个回合(episode)开始时,玩家处于状态 [0](即位置 [0, 0])。
奖励(Rewards)
奖励机制如下:
- 到达目标:+1
- 掉入洞中:0
- 停留在冰冻湖面上(未到达目标也未掉入洞中):0
回合结束(Episode End)
当以下情况发生时,回合结束:
- 终止条件(Termination):玩家掉入洞中。
- 玩家到达目标位置,该位置为 max(nrow) * max(ncol) - 1(即位置 [max(nrow)-1, max(ncol)-1])。
- 截断条件(Truncation,当使用 time_limit 包装器时):
- 对于 4x4 的环境,回合长度为 100。
- 对于 FrozenLake8x8-v1 环境,回合长度为 200。
信息(Information)
step()
和 reset()
函数返回一个包含以下键的字典:
- p - 状态的转移概率。
- 有关转移概率的详细信息,请参见 is_slippery 参数的说明。
参数 (Arguments)
import gymnasium as gym
gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=True)
- “S” for Start tile
- “G” for Goal tile
- “F” for frozen tile
- “H” for a tile with a hole
随机地图生成:
from gymnasium.envs.toy_text.frozen_lake import generate_random_map
gym.make('FrozenLake-v1', desc=generate_random_map(size=8))
is_slippery=True
:如果设置为 True,玩家将以 1/3 的概率按照预想的方向移动,否则将以相等的概率(即每个垂直方向各 1/3 的概率)朝着与预想方向垂直的其中一个方向移动。
Frozen Lake ⛄ (non slippery version)
训练Q-Learning
智能体,使其仅在冰冻的格子(F)上行走,避开洞穴(H),从而从起始状态(S)导航到目标状态(G)。
有两种尺寸的环境:
- map_name=“4x4”:4x4 的网格版本
- map_name=“8x8”:8x8 的网格版本
该环境有两种模式:
- is_slippery=False:由于冰冻湖面不滑,智能体总是会按照预想的方向移动(确定性环境)。
- is_slippery=True:由于冰冻湖面湿滑,智能体可能并不总是会按照预想的方向移动(随机性环境)。
先从简单的 4x4 地图和不滑模式开始。我们添加了一个名为 render_mode = "rgb_array"
的参数,用于指定环境的可视化方式。“rgb_array”:返回一个表示环境当前状态的单帧图像。这个帧是一个形状为 (x, y, 3) 的 np.ndarray
,代表一个 x 行 y 列像素图像的 RGB 值。
import os
import tqdm
import random # To generate random numbers
import imageio # To generate a replay video
import numpy as np
import gymnasium as gym
import pickle5 as pickle # Save/Load model
from tqdm.notebook import tqdm
# 创建环境
env = gym.make("FrozenLake-v1", map_name="8x8", is_slippery=False, render_mode="rgb_array")
print("_____OBSERVATION SPACE_____ \n")
print("Observation Space", env.observation_space)
print("Sample observation", env.observation_space.sample()) # Get a random observation
_____OBSERVATION SPACE_____ Observation Space Discrete(64)
Sample observation 35
print("\n _____ACTION SPACE_____ \n")
print("Action Space Shape", env.action_space.n)
print("Action Space Sample", env.action_space.sample()) # Take a random action
_____ACTION SPACE_____ Action Space Shape 4
Action Space Sample 1
创建并初始化Q-Table
state_space = env.observation_space.n
action_space = env.action_space.ndef initialize_q_table(state_space, action_space):Qtable = np.zeros((state_space, action_space))return QtableQtable_frozenlake = initialize_q_table(state_space, action_space)print("Q-Table :\n", Qtable_frozenlake)
Q-Table :[[0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.]]
定义贪心策略(updating policy)
当 Q 学习智能体完成训练后,我们最终所采用的策略也将是贪婪策略。贪婪策略用于利用 Q 表来选择动作。
def greedy_policy(Qtable, state):# Exploitation: take the action with the highest state, action valueaction = np.argmax(Qtable[state][:])return action
定义 ϵ \epsilon ϵ-贪心策略(acting policy)
ε-贪婪策略(Epsilon-greedy)是一种在训练过程中用于平衡探索(exploration)与利用(exploitation)权衡的训练策略。
ε-贪婪策略的理念如下:
- 以概率 1 - ε:我们进行利用(即智能体选择具有最高状态-动作对值的动作)。
- 以概率 ε:我们进行探索(尝试一个随机动作)。
随着训练的持续进行,我们会逐渐减小 ε 的值,因为随着训练的深入,我们需要的探索会越来越少,而利用则会越来越多。
def epsilon_greedy_policy(Qtable, state, epsilon):# Randomly generate a number between 0 and 1random_num = random.uniform(0, 1)# if random_num > greater than epsilon --> exploitationif random_num > epsilon:# Take the action with the highest value given a stateaction = greedy_policy(Qtable, state)# else --> explorationelse:action = env.action_space.sample()return action
定义超参数
探索(exploration)相关的超参数是其中最为重要的,我们需要确保智能体能够充分探索状态空间,以学习到一个良好的价值近似。为了实现这一点,我们需要让 ε 值(探索率)逐渐衰减。如果你将 ε 值衰减得过快(即衰减率设置得过高),你的智能体可能会陷入困境,因为它没有充分探索状态空间,从而无法解决问题。
# Training parameters
n_training_episodes = 10000 # Total training episodes
learning_rate = 0.7 # Learning rate# Evaluation parameters
n_eval_episodes = 100 # Total number of test episodes# Environment parameters
env_id = "FrozenLake-v1" # Name of the environment
max_steps = 200 # Max steps per episode
gamma = 0.95 # Discounting rate
eval_seed = [] # The evaluation seed of the environment# Exploration parameters
max_epsilon = 1.0 # Exploration probability at start
min_epsilon = 0.05 # Minimum exploration probability
decay_rate = 0.0001 # Exponential decay rate for exploration prob
训练Agent
余弦退火(Cosine Annealing)是一种在深度学习和强化学习中常用的学习率或探索率调整策略,其特点在于能够产生一种“先慢后快”的衰减效果,非常适合用于探索率的衰减。余弦退火衰减函数基于余弦函数的性质,其公式可以表示为:
ϵ t = ϵ m i n + 1 2 ( ϵ m a x − ϵ m i n ) ( 1 + cos ( t π T ) ) \epsilon_t = \epsilon_{min} + \frac{1}{2}(\epsilon_{max} - \epsilon_{min})(1 + \cos(\frac{t\pi}{T})) ϵt=ϵmin+21(ϵmax−ϵmin)(1+cos(Ttπ))
def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):for episode in tqdm(range(n_training_episodes)):# Reduce epsilon (because we need less and less exploration)#epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)epsilon = min_epsilon + (max_epsilon - min_epsilon)*0.5 * (1 + np.cos(episode*np.pi/n_training_episodes))# Reset the environmentstate, info = env.reset()step = 0terminated = Falsetruncated = False# repeatfor step in range(max_steps):# Choose the action At using epsilon greedy policyaction = epsilon_greedy_policy(Qtable, state, epsilon)# Take action At and observe Rt+1 and St+1# Take the action (a) and observe the outcome state(s') and reward (r)new_state, reward, terminated, truncated, info = env.step(action)# Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]Qtable[state][action] = Qtable[state][action] + learning_rate * (reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action])# If terminated or truncated finish the episodeif terminated or truncated:break# Our next state is the new statestate = new_statereturn Qtable
Qtable_frozenlake = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_frozenlake)
print("Q table:\n",Qtable_frozenlake)
0%| | 0/10000 [00:00<?, ?it/s]Q table:[[0.48767498 0.51334208 0.51334208 0.48767498][0.48767498 0.54036009 0.54036009 0.51334208][0.51334208 0.56880009 0.56880009 0.54036009][0.54036009 0.59873694 0.59873694 0.56880009][0.56880009 0.63024941 0.63024941 0.59873694][0.59873694 0.66342043 0.66342043 0.63024941][0.63024941 0.6983373 0.6983373 0.66342043][0.66342043 0.73509189 0.6983373 0.6983373 ][0.51334208 0.54036009 0.54036009 0.48767498][0.51334208 0.56880009 0.56880009 0.51334208][0.54036009 0.59873694 0.59873694 0.54036009][0.56880009 0. 0.63024941 0.56880009][0.59873694 0.66342043 0.66342043 0.59873694][0.63024941 0.6983373 0.6983373 0.63024941][0.66342043 0.73509189 0.73509189 0.66342043][0.6983373 0.77378094 0.73509189 0.6983373 ][0.54036009 0.56880009 0.56880009 0.51334208][0.54036009 0.59873694 0.59873694 0.54036009][0.56880009 0.63024941 0. 0.56880009][0. 0. 0. 0. ][0. 0.6983373 0.6983373 0.63024941][0.66342043 0. 0.73509189 0.66342043][0.6983373 0.77378094 0.77378094 0.6983373 ][0.73509189 0.81450625 0.77378094 0.73509189][0.56880009 0.54036009 0.59873694 0.54036009][0.56880009 0.56880009 0.63024941 0.56880009][0.59873694 0.59873694 0.66342043 0.59873694][0.63024941 0. 0.6983373 0. ][0.66342043 0.73509189 0. 0.66342043][0. 0. 0. 0. ][0. 0.81450625 0.81450625 0.73509189][0.77378094 0.857375 0.81450625 0.77378094][0.54036009 0.51334208 0.56880009 0.56880009][0.54036009 0. 0.59873694 0.59873694][0.56880009 0. 0. 0.63024941][0. 0. 0. 0. ][0. 0.6983373 0.77378094 0.6983373 ][0.73509189 0.73509189 0.81450625 0. ][0.77378094 0. 0.857375 0.77378094][0.81450625 0.9025 0.857375 0.81450625][0.51334208 0.48767498 0. 0.54036009][0. 0. 0. 0. ][0. 0. 0. 0. ][0. 0. 0.6983373 0. ][0.66342043 0. 0.73509189 0.73509189][0.6983373 0.6983373 0. 0.77378094][0. 0. 0. 0. ][0. 0.95 0.9025 0.857375 ][0.48762436 0.46329118 0. 0.51334208][0. 0. 0. 0. ][0. 0. 0. 0. ][0. 0. 0. 0. ][0. 0. 0. 0. ][0. 0.66337945 0. 0.73509189][0. 0. 0. 0. ][0. 1. 0.95 0.9025 ][0.46329049 0.46328851 0.41401517 0.48767498][0.4598555 0.28526749 0. 0. ][0. 0. 0. 0. ][0. 0. 0. 0. ][0. 0. 0. 0. ][0. 0.52391649 0. 0.69833488][0. 0. 0. 0. ][0. 0. 0. 0. ]]
评估Agent
def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):"""Evaluate the agent for ``n_eval_episodes`` episodes and returns average reward and std of reward.:param env: The evaluation environment:param n_eval_episodes: Number of episode to evaluate the agent:param Q: The Q-table:param seed: The evaluation seed array (for taxi-v3)"""episode_rewards = []for episode in tqdm(range(n_eval_episodes)):if seed:state, info = env.reset(seed=seed[episode])else:state, info = env.reset()step = 0truncated = Falseterminated = Falsetotal_rewards_ep = 0for step in range(max_steps):# Take the action (index) that have the maximum expected future reward given that stateaction = greedy_policy(Q, state)new_state, reward, terminated, truncated, info = env.step(action)total_rewards_ep += rewardif terminated or truncated:breakstate = new_stateepisode_rewards.append(total_rewards_ep)mean_reward = np.mean(episode_rewards)std_reward = np.std(episode_rewards)return mean_reward, std_reward
# Evaluate our Agent
mean_reward, std_reward = evaluate_agent(env, max_steps, n_eval_episodes, Qtable_frozenlake, eval_seed)
print(f"Mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")
0%| | 0/100 [00:00<?, ?it/s]Mean_reward=1.00 +/- 0.00
可视化Q-L Agent:
env = gym.wrappers.RecordVideo(env, video_folder="./FrozenLake-v1-QL",disable_logger=True,fps=30)
state, info = env.reset()
for step in range(max_steps):action = greedy_policy(Qtable_frozenlake, state)state, reward, terminated, truncated, info = env.step(action)if terminated == True:break
env.close()
Frozen Lake ⛄ (slippery version)
# 创建环境
slippery_env = gym.make("FrozenLake-v1", map_name="8x8", is_slippery=True, render_mode="rgb_array")
# 初始化Q-Table
SQtable_frozenlake = initialize_q_table(slippery_env.observation_space.n, slippery_env.action_space.n)print("Q-Table :\n", SQtable_frozenlake)
Q-Table :[[0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.]]
# Training parameters
n_training_episodes = 30000 # Total training episodes
SQtable_frozenlake = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, slippery_env, max_steps, SQtable_frozenlake)
print("Slippery Q table:\n",SQtable_frozenlake)
0%| | 0/30000 [00:00<?, ?it/s]Slippery Q table:[[2.95092538e-02 2.98058559e-02 3.55425091e-02 2.92342470e-02][3.42323951e-02 4.00210946e-02 2.16868036e-02 2.13141622e-02][3.14652001e-02 4.41281793e-02 3.17972584e-02 6.43655693e-02][4.89825234e-02 3.73930063e-02 7.97258599e-02 4.23927897e-02][5.15487856e-02 5.45440972e-02 8.14347651e-02 6.17998290e-02][8.66710970e-02 5.41912201e-02 5.47576619e-02 5.39549992e-02][6.81525386e-02 8.96955605e-02 1.19140669e-01 6.43553105e-02][1.01704389e-01 8.04192328e-02 8.43238258e-02 8.44453576e-02][1.87306116e-02 2.36937038e-02 1.80814138e-02 3.88206186e-02][1.89657996e-02 2.16608203e-02 1.83821238e-02 4.74695342e-02][2.66255882e-02 2.70086379e-02 2.40275840e-02 7.01938623e-02][2.97640728e-02 6.13349121e-03 1.41040909e-02 7.60960874e-02][4.56302719e-02 4.00950988e-02 5.37607732e-02 4.24232515e-02][5.69634607e-02 9.01739578e-02 7.01677881e-02 5.63298290e-02][9.78335315e-02 9.06127357e-02 1.06052611e-01 7.17523463e-02][1.14012792e-01 1.40811865e-01 1.04272680e-01 7.99623572e-02][1.66456648e-02 1.62485470e-02 1.74163490e-02 3.27535019e-02][1.51419958e-02 1.35046417e-02 1.45398887e-02 1.76688930e-02][1.90295521e-02 4.25410252e-03 1.83213222e-03 5.23178872e-03][0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00][1.58533465e-02 3.67224622e-02 4.99585459e-02 1.75388507e-02][6.61441587e-02 1.10450323e-02 3.11163963e-02 8.05712877e-02][1.07919603e-01 7.22318163e-02 1.57449939e-01 8.42769872e-02][2.61324179e-01 1.16283409e-01 1.31168208e-01 1.16575381e-01][1.29772186e-02 9.80512299e-03 1.14779611e-02 1.64368035e-02][1.60933016e-02 1.31074728e-02 4.39518357e-03 1.67676482e-02][8.69393524e-03 1.04075372e-03 9.52817187e-03 1.23338489e-02][1.22384041e-05 3.33445738e-03 2.72660768e-04 4.60460192e-04][1.67221437e-02 2.29019203e-03 8.82719764e-02 3.48791688e-03][0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00][4.98473132e-02 1.14280468e-01 4.18399272e-01 1.67182467e-02][1.81395051e-01 1.72917909e-01 4.36020228e-01 1.24908431e-01][1.46880265e-02 4.16296475e-03 5.19508161e-03 9.44630730e-03][1.37255941e-03 2.32290503e-03 2.75471190e-04 1.57960307e-02][1.20043561e-02 8.62992516e-06 4.91368229e-06 9.49594317e-05][0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00][3.80016972e-04 2.01246118e-03 4.14847844e-02 1.84458017e-02][8.13313604e-02 1.43051670e-01 1.83712719e-02 4.43525885e-04][5.54783837e-03 1.17747589e-02 2.17635398e-02 4.91418874e-01][2.28295381e-01 1.92538698e-01 6.01683424e-01 2.23992518e-01][9.27879652e-03 1.10868374e-04 1.08306191e-03 2.25001087e-03][0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00][0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00][1.00862789e-05 5.52190401e-04 1.80929931e-05 7.63147199e-06][3.06042994e-04 1.37267286e-03 5.67208929e-02 2.00546095e-03][3.70648973e-02 1.78282543e-03 2.28593220e-02 2.80359293e-02][0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00][5.87143371e-02 2.15910759e-01 7.11720489e-01 1.25217126e-02][2.32287713e-03 1.36611363e-03 1.14438549e-03 1.64655400e-03][0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00][1.79431114e-05 6.40070118e-05 3.93668028e-04 1.58774176e-05][1.67603193e-04 3.36343482e-05 0.00000000e+00 2.98572949e-05][0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00][1.88518050e-04 1.03447142e-04 6.00621301e-03 2.24384032e-05][0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00][2.76773042e-01 1.49805588e-01 9.34268361e-01 5.30816981e-01][1.72881462e-03 7.73892342e-04 9.74143329e-04 8.28502800e-04][4.18900346e-04 1.36431192e-03 4.80345616e-04 4.90415880e-04][4.37120136e-04 6.63167566e-05 8.11219030e-05 1.93626697e-04][0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00][0.00000000e+00 0.00000000e+00 0.00000000e+00 2.02683032e-02][4.13766214e-01 1.66513079e-01 6.39326884e-01 4.46513193e-01][5.20579883e-01 9.53990423e-01 8.54776446e-02 1.89000000e-02][0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]]
# Evaluate our Agent
mean_reward, std_reward = evaluate_agent(slippery_env, max_steps, n_eval_episodes, SQtable_frozenlake, eval_seed)
print(f"Mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")
0%| | 0/100 [00:00<?, ?it/s]Mean_reward=0.37 +/- 0.48
可视化Slippery Agent:
slippery_env = gym.wrappers.RecordVideo(slippery_env, video_folder="./Slippery-FrozenLake-v1-QL",disable_logger=True,fps=30)
state, info = slippery_env.reset()
for step in range(max_steps):action = greedy_policy(SQtable_frozenlake, state)state, reward, terminated, truncated, info = slippery_env.step(action)if terminated == True:break
slippery_env.close()

