当前位置: 首页 > news >正文

RL【1】:Basic Concepts

系列文章目录


文章目录

  • 系列文章目录
  • 前言
  • Fundamental concepts in Reinforcement Learning
  • Markov decision process (MDP)
  • 总结


前言

本系列文章主要用于记录 B站 赵世钰老师的【强化学习的数学原理】的学习笔记,关于赵老师课程的具体内容,可以移步:
B站视频:【【强化学习的数学原理】课程:从零开始到透彻理解(完结)】
GitHub 课程资料:Book-Mathematical-Foundation-of-Reinforcement-Learning


Fundamental concepts in Reinforcement Learning

  • State: The status of the agent with respect to the environment.
  • State space: the set of all states. S={si}i=1NS = \{ s_i \}^N_{i=1}S={si}i=1N
  • Action: For each state, an action refers to a set of possible moves or operations that the agent can take.
  • Action space of a state: the set of all possible actions of a state. A(si)={ai}i=1NA(s_i) = \{a_i\}^N_{i=1}A(si)={ai}i=1N
  • State transition: When taking an action, the agent may move from one state to another. Such a process is called state transition. State transition defines the interaction with the environment.
  • Policy π\piπ: Policy tells the agent what actions to take at a state.
  • Reward: a real number we get after taking an action.
    • A positive reward represents encouragement to take such actions.
    • A negative reward represents punishment to take such actions.
    • Reward can be interpreted as a human-machine interface, with which we can guide the agent to behave as what we expect.
    • The reward depends on the state and action, but not the next state.
  • Trajectory: A trajectory is a state-action-reward chain.
  • Return: The return of this trajectory is the sum of all the rewards collected along the trajectory.
  • Discount rate: The discount rate is a scalar factor γ∈[0,1)\gamma \in [0,1)γ[0,1) that determines the present value of future rewards. It specifies how much importance the agent assigns to rewards received in the future compared to immediate rewards.
    • A smaller value of γ\gammaγ makes the agent more short-sighted, emphasizing immediate returns.
    • A value closer to 1 encourages long-term planning by valuing distant rewards nearly as much as immediate ones.
  • Discounted return GtG_tGt: The discounted return is the cumulative reward the agent aims to maximize, defined as the weighted sum of future rewards starting from time step ttt
    • Gt=Rt+1+γRt+2+γ2Rt+3+⋯=∑k=0∞γkRt+k+1.G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}.Gt=Rt+1+γRt+2+γ2Rt+3+=k=0γkRt+k+1.
    • It captures both immediate and future rewards while incorporating the discount rate γ\gammaγ, thereby balancing short-term and long-term gains.
  • Episode: When interacting with the environment following a policy, the agent may stop
    at some terminal states. The resulting trajectory is called an episode (or a
    trial).
    • An episode is usually assumed to be a finite trajectory. Tasks with episodes are called episodic tasks.
    • Some tasks may have no terminal states, meaning the interaction with the environment will never end. Such tasks are called continuing tasks.
    • In fact, we can treat episodic and continuing tasks in a unified mathematical way by converting episodic tasks to continuing tasks.
      • Option 1: Treat the target state as a special absorbing state. Once the agent reaches an absorbing state, it will never leave. The consequent rewards r=0r = 0r=0.
      • Option 2: Treat the target state as a normal state with a policy. The agent can still leave the target state and gain r=r+1r= r+1r=r+1 when entering the target state.

Markov decision process (MDP)

  • Sets:
    • State: the set of states S\mathcal{S}S
    • Action: the set of actions A(s)\mathcal{A}(s)A(s) associated with state s∈Ss \in \mathcal{S}sS
    • Reward: the set of rewards R(s,a)\mathcal{R}(s, a)R(s,a)
  • Probability distribution:
    • State transition probability: at state sss, taking action aaa, the probability of transitioning to state s′s's is p(s′∣s,a)p(s' | s, a)p(ss,a)
    • Reward probability: at state sss, taking action aaa, the probability of receiving reward rrr is p(r∣s,a)p(r | s, a)p(rs,a)
  • Policy: at state sss, the probability of choosing action aaa is π(a∣s)\pi(a | s)π(as)
  • Markov property: memoryless property
    • p(st+1∣at,st,...,a0,s0)=p(st+1∣at,st)p(s_{t+1} | a_t, s_t, ..., a_0, s_0) = p(s_{t+1} | a_t, s_t)p(st+1at,st,...,a0,s0)=p(st+1at,st)
    • p(rt+1∣at,st,...,a0,s0)=p(rt+1∣at,st)p(r_{t+1} | a_t, s_t, ..., a_0, s_0) = p(r_{t+1} | a_t, s_t)p(rt+1at,st,...,a0,s0)=p(rt+1at,st)

总结

第一节系统性地介绍了 RL 中常见的一些概念,并做了通俗易懂的解释。同时,结合了马尔可夫决策过程,用数学公式化的语言进一步讲解了各概念的含义,为后续内容的讲解做好了铺垫。

http://www.xdnf.cn/news/1434799.html

相关文章:

  • 情况三:已经 add ,并且也 commit 了
  • 机器人控制器开发(整体架构2 Lerobot介绍)
  • 佛山体彩第二届唱享之夜浪漫收官, 七夕音乐派对全场大合唱!
  • 使用 Gulp + Webpack 打造一个完整的 TypeScript 库构建流程
  • 社区医疗健康管理系统的设计与实现-(源码+LW+可部署)
  • Linux92 shell:倒计时,用户分类
  • [re_2] rpc|http|nginx|protobuf|
  • HBuilder X 4.76 开发微信小程序集成 uview-plus
  • 【Linux我做主】进程退出和终止详解
  • C++编程语言:标准库:第37章——正则表达式(Bjarne Stroustrup)
  • 拷打字节面试官之-吃透c语言-哈希算法 如何在3面拷打字节cto 3万行算法源码带你吃透算法面试所有考题
  • 【完整源码+数据集+部署教程】鸡粪病害检测系统源码和数据集:改进yolo11-bifpn-SDI
  • 前端开发中经常提到的iframe、DOM是什么?
  • WPF中的DataContext以及常见的绑定方式
  • windows下wsl2 ubuntu开发配置
  • 破解人事管理非标化困境:启效云低代码如何助力业务突围?
  • 为什么同步是无线通信的灵魂?WiFi 与 5G 帧结构中的关键技术
  • 创建一个只能直接构造和销毁,但不能被复制和移动的基类
  • burpsuite使用之CaA神器使用
  • 2025年企业级数据服务API平台大全和接入指南
  • Text2SQL与DataAgent技术深度对比与实践指南
  • Java集合源码解析之LinkedList
  • 串口服务器技术详解:2025年行业标准与应用指南
  • 今天我们继续学习shell编程语言的内容
  • Vscode + docker + qt 网络监听小工具
  • 方差分析(通俗易理解)
  • Java代码耗时统计的5种方法
  • docker redis容器命令行操作
  • # pdf.js完全指南:构建现代Web PDF查看与解析解决方案
  • flume扩展实战:自定义拦截器、Source 与 Sink 全指南