当前位置：首页 > news >正文

RL【1】：Basic Concepts

news 2025/9/3 11:58:29

系列文章目录

文章目录

系列文章目录
前言
Fundamental concepts in Reinforcement Learning
Markov decision process (MDP)
总结

前言

本系列文章主要用于记录 B站赵世钰老师的【强化学习的数学原理】的学习笔记，关于赵老师课程的具体内容，可以移步：
B站视频：【【强化学习的数学原理】课程：从零开始到透彻理解（完结）】
GitHub 课程资料：Book-Mathematical-Foundation-of-Reinforcement-Learning

Fundamental concepts in Reinforcement Learning

State: The status of the agent with respect to the environment.
State space: the set of all states. $\{ s_i \}^N_{i=1}$
Action: For each state, an action refers to a set of possible moves or operations that the agent can take.
Action space of a state: the set of all possible actions of a state. $A(si)={ai}i=1NA(s_i) = \{a_i\}^N_{i=1}$
State transition: When taking an action, the agent may move from one state to another. Such a process is called state transition. State transition defines the interaction with the environment.
Policy $π\pi$ : Policy tells the agent what actions to take at a state.
Reward: a real number we get after taking an action.
- A positive reward represents encouragement to take such actions.
- A negative reward represents punishment to take such actions.
- Reward can be interpreted as a human-machine interface, with which we can guide the agent to behave as what we expect.
- The reward depends on the state and action, but not the next state.
Trajectory: A trajectory is a state-action-reward chain.
Return: The return of this trajectory is the sum of all the rewards collected along the trajectory.
Discount rate: The discount rate is a scalar factor $γ∈[0,1)\gamma \in [0,1)$ that determines the present value of future rewards. It specifies how much importance the agent assigns to rewards received in the future compared to immediate rewards.
- A smaller value of $γ\gamma$ makes the agent more short-sighted, emphasizing immediate returns.
- A value closer to 1 encourages long-term planning by valuing distant rewards nearly as much as immediate ones.
Discounted return $G_t$ : The discounted return is the cumulative reward the agent aims to maximize, defined as the weighted sum of future rewards starting from time step $t$
- $Gt=Rt+1+γRt+2+γ2Rt+3+⋯=∑k=0∞γkRt+k+1.G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}.$
- It captures both immediate and future rewards while incorporating the discount rate $γ\gamma$ , thereby balancing short-term and long-term gains.
Episode: When interacting with the environment following a policy, the agent may stop
at some terminal states. The resulting trajectory is called an episode (or a
trial).
- An episode is usually assumed to be a finite trajectory. Tasks with episodes are called episodic tasks.
- Some tasks may have no terminal states, meaning the interaction with the environment will never end. Such tasks are called continuing tasks.
- In fact, we can treat episodic and continuing tasks in a unified mathematical way by converting episodic tasks to continuing tasks.
  - Option 1: Treat the target state as a special absorbing state. Once the agent reaches an absorbing state, it will never leave. The consequent rewards $r = 0$ .
  - Option 2: Treat the target state as a normal state with a policy. The agent can still leave the target state and gain $r = r + 1$ when entering the target state.

Markov decision process (MDP)

Sets:
- State: the set of states $S\mathcal{S}$
- Action: the set of actions $A(s)\mathcal{A}(s)$ associated with state $\in \mathcal{S}$
- Reward: the set of rewards $R(s,a)\mathcal{R}(s, a)$
Probability distribution:
- State transition probability: at state $s$ , taking action $a$ , the probability of transitioning to state $s^{'}$ is $p (s^{'} ∣ s, a)$
- Reward probability: at state $s$ , taking action $a$ , the probability of receiving reward $r$ is $p (r ∣ s, a)$
Policy: at state $s$ , the probability of choosing action $a$ is $π(a∣s)\pi(a | s)$
Markov property: memoryless property
- $p(s_{t+1} | a_t, s_t, ..., a_0, s_0) = p(s_{t+1} | a_t, s_t)$
- $p(r_{t+1} | a_t, s_t, ..., a_0, s_0) = p(r_{t+1} | a_t, s_t)$