RL【1】:Basic Concepts
系列文章目录
文章目录
- 系列文章目录
- 前言
- Fundamental concepts in Reinforcement Learning
- Markov decision process (MDP)
- 总结
前言
本系列文章主要用于记录 B站 赵世钰老师的【强化学习的数学原理】的学习笔记,关于赵老师课程的具体内容,可以移步:
B站视频:【【强化学习的数学原理】课程:从零开始到透彻理解(完结)】
GitHub 课程资料:Book-Mathematical-Foundation-of-Reinforcement-Learning
Fundamental concepts in Reinforcement Learning
- State: The status of the agent with respect to the environment.
- State space: the set of all states. S={si}i=1NS = \{ s_i \}^N_{i=1}S={si}i=1N
- Action: For each state, an action refers to a set of possible moves or operations that the agent can take.
- Action space of a state: the set of all possible actions of a state. A(si)={ai}i=1NA(s_i) = \{a_i\}^N_{i=1}A(si)={ai}i=1N
- State transition: When taking an action, the agent may move from one state to another. Such a process is called state transition. State transition defines the interaction with the environment.
- Policy π\piπ: Policy tells the agent what actions to take at a state.
- Reward: a real number we get after taking an action.
- A positive reward represents encouragement to take such actions.
- A negative reward represents punishment to take such actions.
- Reward can be interpreted as a human-machine interface, with which we can guide the agent to behave as what we expect.
- The reward depends on the state and action, but not the next state.
- Trajectory: A trajectory is a state-action-reward chain.
- Return: The return of this trajectory is the sum of all the rewards collected along the trajectory.
- Discount rate: The discount rate is a scalar factor γ∈[0,1)\gamma \in [0,1)γ∈[0,1) that determines the present value of future rewards. It specifies how much importance the agent assigns to rewards received in the future compared to immediate rewards.
- A smaller value of γ\gammaγ makes the agent more short-sighted, emphasizing immediate returns.
- A value closer to 1 encourages long-term planning by valuing distant rewards nearly as much as immediate ones.
- Discounted return GtG_tGt: The discounted return is the cumulative reward the agent aims to maximize, defined as the weighted sum of future rewards starting from time step ttt
- Gt=Rt+1+γRt+2+γ2Rt+3+⋯=∑k=0∞γkRt+k+1.G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}.Gt=Rt+1+γRt+2+γ2Rt+3+⋯=∑k=0∞γkRt+k+1.
- It captures both immediate and future rewards while incorporating the discount rate γ\gammaγ, thereby balancing short-term and long-term gains.
- Episode: When interacting with the environment following a policy, the agent may stop
at some terminal states. The resulting trajectory is called an episode (or a
trial).- An episode is usually assumed to be a finite trajectory. Tasks with episodes are called episodic tasks.
- Some tasks may have no terminal states, meaning the interaction with the environment will never end. Such tasks are called continuing tasks.
- In fact, we can treat episodic and continuing tasks in a unified mathematical way by converting episodic tasks to continuing tasks.
- Option 1: Treat the target state as a special absorbing state. Once the agent reaches an absorbing state, it will never leave. The consequent rewards r=0r = 0r=0.
- Option 2: Treat the target state as a normal state with a policy. The agent can still leave the target state and gain r=r+1r= r+1r=r+1 when entering the target state.
Markov decision process (MDP)
- Sets:
- State: the set of states S\mathcal{S}S
- Action: the set of actions A(s)\mathcal{A}(s)A(s) associated with state s∈Ss \in \mathcal{S}s∈S
- Reward: the set of rewards R(s,a)\mathcal{R}(s, a)R(s,a)
- Probability distribution:
- State transition probability: at state sss, taking action aaa, the probability of transitioning to state s′s's′ is p(s′∣s,a)p(s' | s, a)p(s′∣s,a)
- Reward probability: at state sss, taking action aaa, the probability of receiving reward rrr is p(r∣s,a)p(r | s, a)p(r∣s,a)
- Policy: at state sss, the probability of choosing action aaa is π(a∣s)\pi(a | s)π(a∣s)
- Markov property: memoryless property
- p(st+1∣at,st,...,a0,s0)=p(st+1∣at,st)p(s_{t+1} | a_t, s_t, ..., a_0, s_0) = p(s_{t+1} | a_t, s_t)p(st+1∣at,st,...,a0,s0)=p(st+1∣at,st)
- p(rt+1∣at,st,...,a0,s0)=p(rt+1∣at,st)p(r_{t+1} | a_t, s_t, ..., a_0, s_0) = p(r_{t+1} | a_t, s_t)p(rt+1∣at,st,...,a0,s0)=p(rt+1∣at,st)
总结
第一节系统性地介绍了 RL 中常见的一些概念,并做了通俗易懂的解释。同时,结合了马尔可夫决策过程,用数学公式化的语言进一步讲解了各概念的含义,为后续内容的讲解做好了铺垫。