当前位置：首页 > ai >正文

生成模型 | 扩散模型损失函数公式推导

ai 2025/8/24 13:57:12

接上文：生成模型 | 扩散模型公式推导，原始正文补充内容后过长了，网站需要切分。

损失函数

考虑到扩散模型生成图片的过程是一步一步的序列，所以生成真实样本 $x_0$ 的概率可以写成所有中间转筒序列的联合概率的积分：
$pθ(x0)=∫pθ(x0,x1,…,xT)dx1…dxTp_\theta(x_0) = \int p_\theta(x_0, x_1, \dots, x_T) dx_1 \dots dx_T$

而训练模型的目标是让它生成的样本分布 $pθ(x)p_\theta(x)$ 更符合训练集分布（ $q(x_0)$ ）。所以模型整体损失可以描述为：
$L=Eq(x0)[−log⁡pθ(x0)]=−∫q(x0)log⁡pθ(x0)dx0=−∫q(x0)log⁡(∫pθ(x0:T)dx1:T)dx0(本身需要考虑所有噪声轨迹难以计算)=−∫q(x0)log⁡(∫q(x1:T∣x0)pθ(x0:T)q(x1:T∣x0)dx1:T)dx0(引入容易采样的前向分布来简化计算)≤−∫q(x0)(∫q(x1:T∣x0)log⁡pθ(x0:T)q(x1:T∣x0)dx1:T)dx0(∫q(x1:T∣x0)dx1:T=1)=−∫q(x0)q(x1:T∣x0)log⁡pθ(x0:T)q(x1:T∣x0)dx0:T=−∫q(x0:T)log⁡pθ(x0:T)q(x1:T∣x0)dx0:T=−Eq(x0:T)[log⁡pθ(x0:T)q(x1:T∣x0)]=Eq(x0:T)[log⁡q(x1:T∣x0)pθ(x0:T)]=Eq(x0:T)[log⁡q(x1:T∣x0)−log⁡pθ(x0:T)]=Eq(x0:T)[log⁡∏t=1Tq(xt∣xt−1)−log⁡[pθ(xT)∏t=1Tpθ(xt−1∣xt)]]=Eq(x0:T)[∑t=1Tlog⁡q(xt∣xt−1)−log⁡pθ(xT)−∑t=1Tlog⁡pθ(xt−1∣xt)](q(xt∣xt−1)=q(xt∣xt−1,x0)=q(xt,xt−1,x0)q(xt−1,x0)=q(xt−1∣xt,x0)q(xt∣x0)q(x0)q(xt−1x0)=q(xt−1∣xt,x0)q(xt∣x0)q(xt−1∣x0))=Eq(x0:T)[∑t=2Tlog⁡q(xt−1∣xt,x0)pθ(xt−1∣xt)q(xt∣x0)q(xt−1∣x0)+log⁡q(x1∣x0)pθ(x0∣x1)−log⁡pθ(xT)]=Eq(x0:T)[∑t=2Tlog⁡q(xt−1∣xt,x0)pθ(xt−1∣xt)+∑t=2Tlog⁡q(xt∣x0)q(xt−1∣x0)+log⁡q(x1∣x0)pθ(x0∣x1)−log⁡pθ(xT)]=Eq(x0:T)[∑t=2Tlog⁡q(xt−1∣xt,x0)pθ(xt−1∣xt)+log⁡q(xT∣x0)q(x1∣x0)+log⁡q(x1∣x0)pθ(x0∣x1)−log⁡pθ(xT)]=Eq(x0:T)[∑t=2Tlog⁡q(xt−1∣xt,x0)pθ(xt−1∣xt)+log⁡q(xT∣x0)pθ(xT)pθ(x0∣x1)]=Eq(x0:T)[log⁡q(xT∣x0)pθ(xT)+∑t=2Tlog⁡q(xt−1∣xt,x0)pθ(xt−1∣xt)−log⁡pθ(x0∣x1)]=Eq(x0:T)[log⁡q(xT∣x0)pθ(xT)]+∑t=2TEq(x0:T)[log⁡q(xt−1∣xt,x0)pθ(xt−1∣xt)]+Eq(x0:T)[−log⁡pθ(x0∣x1)]=Eq(x0)Eq(xT∣x0)[log⁡q(xT∣x0)pθ(xT)]+∑t=2TEq(xt,x0)Eq(xt−1∣xt,x0)[log⁡q(xt−1∣xt,x0)pθ(xt−1∣xt)]+Eq(x0,x1)[−log⁡pθ(x0∣x1)]=Eq(x0)[DKL(q(xT∣x0)∣pθ(xT))]+∑t=2TEq(xt,x0)[DKL(q(xt−1∣xt,x0)∣pθ(xt−1∣xt))]+Eq(x0,x1)[−log⁡pθ(x0∣x1)]\begin{aligned} L & = \mathbb{E}_{q(x_0)} \left[ - \log p_\theta(x_0) \right] \\ & = - \int q(x_0) \log p_\theta(x_0) dx_0 \\ & = - \int q(x_0) \log \left ( \int p_\theta(x_{0:T}) dx_{1:T} \right ) dx_0 \quad (本身需要考虑所有噪声轨迹难以计算) \\ & = - \int q(x_0) \log \left ( \int q(x_{1:T} | x_0) \frac{p_\theta(x_{0:T})}{q(x_{1:T} | x_0)} dx_{1:T} \right ) dx_0 \quad (引入容易采样的前向分布来简化计算) \\ & \le - \int q(x_0) \left( \int q(x_{1:T} | x_0) \log \frac{p_\theta(x_{0:T})}{q(x_{1:T} | x_0)} dx_{1:T} \right) dx_0 \quad (\int q(x_{1:T} | x_0) dx_{1:T} = 1) \\ & = - \int q(x_0) q(x_{1:T} | x_0) \log \frac{p_\theta(x_{0:T})}{q(x_{1:T} | x_0)} dx_{0:T} \\ & = - \int q(x_{0:T}) \log \frac{p_\theta(x_{0:T})}{q(x_{1:T} | x_0)} dx_{0:T} \\ & = - \mathbb{E}_{q(x_{0:T})} \left [ \log \frac{p_\theta(x_{0:T})}{q(x_{1:T} | x_0)} \right ] \\ & = \mathbb{E}_{q(x_{0:T})} \left [ \log \frac{q(x_{1:T} | x_0)}{p_\theta(x_{0:T})} \right ] \\ & = \mathbb{E}_{q(x_{0:T})} \left [ \log q(x_{1:T} | x_0) -\log p_\theta(x_{0:T}) \right ] \\ & = \mathbb{E}_{q(x_{0:T})} \left [ \log \prod^{T}_{t=1} q(x_t | x_{t-1}) - \log [p_{\theta}(x_T) \prod^{T}_{t=1} p_\theta(x_{t-1} | x_t)] \right ] \\ & = \mathbb{E}_{q(x_{0:T})} \left [ \sum^{T}_{t=1} \log q(x_t | x_{t-1}) - \log p_{\theta}(x_T) - \sum^{T}_{t=1} \log p_\theta(x_{t-1} | x_t) \right ] \quad \Big( q(x_t | x_{t-1}) = q(x_t | x_{t-1}, x_0) = \frac{q(x_t, x_{t-1}, x_0)}{q(x_{t-1}, x_0)} = \frac{q(x_{t-1} | x_t, x_0) q(x_t | x_0) q(x_0)}{q(x_{t-1} x_0)} = \frac{q(x_{t-1} | x_t, x_0) q(x_t | x_0)}{q(x_{t-1} | x_0)} \Big) \\ & = \mathbb{E}_{q(x_{0:T})} \left [ \sum^{T}_{t=2} \log \frac{q(x_{t-1} | x_{t}, x_0)}{p_\theta(x_{t-1} | x_t)} \frac{q(x_{t} | x_0) }{q(x_{t-1} | x_0)} + \log \frac{q(x_{1} | x_{0})}{p_\theta(x_{0} | x_1)} - \log p_{\theta}(x_T) \right ] \\ & = \mathbb{E}_{q(x_{0:T})} \left [ \sum^{T}_{t=2} \log \frac{q(x_{t-1} | x_{t}, x_0)}{p_\theta(x_{t-1} | x_t)} + \sum^{T}_{t=2} \log \frac{q(x_{t} | x_0) }{q(x_{t-1} | x_0)} + \log \frac{q(x_{1} | x_{0})}{p_\theta(x_{0} | x_1)} - \log p_{\theta}(x_T) \right ] \\ & = \mathbb{E}_{q(x_{0:T})} \left [ \sum^{T}_{t=2} \log \frac{q(x_{t-1} | x_{t}, x_0)}{p_\theta(x_{t-1} | x_t)} + \log \frac{q(x_{T} | x_0) }{q(x_{1} | x_{0})} + \log \frac{q(x_{1} | x_{0})}{p_\theta(x_{0} | x_1)} - \log p_{\theta}(x_T) \right ] \\ & = \mathbb{E}_{q(x_{0:T})} \left [ \sum^{T}_{t=2} \log \frac{q(x_{t-1} | x_{t}, x_0)}{p_\theta(x_{t-1} | x_t)} + \log \frac{q(x_{T} | x_0) }{p_{\theta}(x_T) p_\theta(x_{0} | x_1)} \right ] \\ & = \mathbb{E}_{q(x_{0:T})} \left [ \log \frac{q(x_{T} | x_0) }{p_{\theta}(x_T)} + \sum^{T}_{t=2} \log \frac{q(x_{t-1} | x_{t}, x_0)}{p_\theta(x_{t-1} | x_t)} - \log p_\theta(x_{0} | x_1) \right ] \\ & = \mathbb{E}_{q(x_{0:T})} \Big[\log \frac{q(x_{T} | x_0) }{p_{\theta}(x_T)} \Big] + \sum^{T}_{t=2} \mathbb{E}_{q(x_{0:T})} \Big[\log \frac{q(x_{t-1} | x_{t}, x_0)}{p_\theta(x_{t-1} | x_t)} \Big] + \mathbb{E}_{q(x_{0:T})} \Big[- \log p_\theta(x_{0} | x_1) \Big] \\ & = \mathbb{E}_{q(x_{0})} \mathbb{E}_{q(x_T | x_{0})} \Big[\log \frac{q(x_{T} | x_0) }{p_{\theta}(x_T)}\Big] + \sum^{T}_{t=2} \mathbb{E}_{q(x_{t}, x_{0})} \mathbb{E}_{q(x_{t-1} | x_{t}, x_{0})} \Big[\log \frac{q(x_{t-1} | x_{t}, x_0)}{p_\theta(x_{t-1} | x_t)} \Big] + \mathbb{E}_{q(x_{0}, x_{1})} \Big[- \log p_\theta(x_{0} | x_1) \Big] \\ & = \mathbb{E}_{q(x_{0})} \Big[D_\text{KL}(q(x_{T} | x_0) \vert p_{\theta}(x_T))\Big] + \sum^{T}_{t=2} \mathbb{E}_{q(x_{t}, x_{0})} \Big[D_\text{KL}(q(x_{t-1} | x_{t}, x_0) \vert p_{\theta}(x_{t-1} | x_t))\Big] + \mathbb{E}_{q(x_{0}, x_{1})} \Big[- \log p_\theta(x_{0} | x_1)\Big] \end{aligned}$

注意，这里用到Jensen 不等式：对于凹函数 (向上凸的) $f (x)$ （例如这里的 $log⁡\log$ 函数），有：
$f(∑i=1nλixi)≥∑i=1nλif(xi),∑i=1nλi=1f(\sum_{i=1}^n \lambda_i x_i) \ge \sum_{i=1}^n \lambda_i f(x_i), \sum_{i=1}^n \lambda_i = 1$
即自变量线性组合对应的函数值大于自变量对应函数值的线性组合。

损失中的第一项当作 $L_T$ ：由于 $q$ 无学习参数且 $x_T$ 为固定高斯噪声，该项当做常量可以忽略。

损失中的最后一项当作 $L_0$ ，考虑到 $pθ(xt−1∣xt)≈q(xt−1∣xt)=N(μ~θ(xt,t),β~tI)p_\theta(x_{t-1}|x_t) \approx q(x_{t-1}|x_t) = \mathcal{N}(\tilde{\mu}_\theta(x_t, t), \tilde{\beta}_t \mathbf{I})$ ，可以近似写成：
$pθ(x0∣x1)=N(μ~θ(x1,1),β~1I)=1(2πβ1)d/2exp⁡(−12β1∥x0−μ~θ(x1,1)∥2)−log⁡pθ(x0∣x1)=d2log⁡(2πβ1)+12β1∥x0−μ~θ(x1,1)∥2∝∥x0−μ~θ(x1,1)∥2\begin{aligned} p_\theta(x_{0} | x_1) & = \mathcal{N}(\tilde{\mu}_\theta(x_1, 1), \tilde{\beta}_1 \mathbf{I}) \\ & = \frac{1}{(2 \pi \beta_1)^{d/2}} \exp(-\frac{1}{2 \beta_1} \|x_0 - \tilde{\mu}_\theta(x_1, 1)\|^2) \\ -\log p_\theta(x_{0} | x_1) & = \frac{d}{2} \log(2 \pi \beta_1) + \frac{1}{2 \beta_1} \|x_0 - \tilde{\mu}_\theta(x_1, 1)\|^2 \\ & \propto \|x_0 - \tilde{\mu}_\theta(x_1, 1)\|^2 \end{aligned}$
很多实验表明即使不单独强调该项，生成样本质量仍然很好，尤其当 $T$ 很大且中间项训练充分时。

损失中的其余项当作 $L_{t-1}$ ，这可以看做在拉近两个高斯分布 $q(xt−1∣xt,x0)=N(μ~t(xt,x0),β~tI)q(x_{t-1} | x_t, x_0) = \mathcal{N}(\tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t \mathbf{I})$ 和 $pθ(xt−1∣xt)=N(μ~θ(xt,t),β~tI)p_\theta(x_{t-1} | x_t) = \mathcal{N}(\tilde{\mu}_\theta(x_t, t), \tilde{\beta}_t \mathbf{I})$ 之间的距离。

考虑到两个 $d$ 维高斯分布，同时协方差矩阵各向同性，二者之间的KL散度基于如下通用公式计算https://en.wikipedia.org/wiki/Kullback–Leibler_divergence#Multivariate_normal_distributions：
$DKL(p∥q)=12(dσp2σq2+∥μp−μq∥2σq2−d−dlog⁡σp2σq2)D_\text{KL}(p \Vert q) = \frac{1}{2} \left( d \frac{\sigma_p^2}{\sigma_q^2} + \frac{\| \mu_p - \mu_q \|^2}{\sigma_q^2} - d - d \log \frac{\sigma_p^2}{\sigma_q^2} \right)$

所以具体形式表示为（ $β~θ=β~t\tilde{\beta}_\theta = \tilde{\beta}_t$ ）：
$DKL(q(xt−1∣xt,x0)∥pθ(xt−1∣xt))=12(dβ~tβ~θ+∥μ~t−μ~θ(xt,t)∥2β~θ−d−dlog⁡β~tβ~θ)=∥μ~t−μ~θ(xt,t)∥22β~t=12β~t∥1αt(xt−βt1−αˉtϵˉt)−1αt(xt−βt1−αˉtϵˉθ(xt,t))∥2=βt22αt(1−αˉt)β~t∥ϵˉt−ϵˉθ(xt,t)∥2=βt22αt(1−αˉt)β~t∥ϵˉt−ϵˉθ(αˉtx0+1−αˉtϵˉt,t)∥2∝∥ϵˉt−ϵˉθ(αˉtx0+1−αˉtϵˉt,t)∥2\begin{aligned} D_\text{KL}(q(\mathbf{x}_{t-1} \vert \mathbf{x}_{t}, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_{t})) & = \frac{1}{2} \left( d \frac{\tilde{\beta}_t}{\tilde{\beta}_\theta} + \frac{\| \tilde{\mu}_t - \tilde{\mu}_\theta(x_t, t) \|^2}{\tilde{\beta}_\theta} - d - d \log \frac{\tilde{\beta}_t}{\tilde{\beta}_\theta} \right) \\ & = \frac{\| \tilde{\mu}_t - \tilde{\mu}_\theta(x_t, t) \|^2}{2 \tilde{\beta}_t} \\ & = \frac{1}{2 \tilde{\beta}_t} \| \frac{1}{ \sqrt{\alpha_t}} (x_t - \frac{ \beta_{t} }{\sqrt{ 1 - \bar{\alpha}_t }} \bar{\epsilon}_t) - \frac{1}{ \sqrt{\alpha_t}} (x_t - \frac{ \beta_{t} }{\sqrt{ 1 - \bar{\alpha}_t }} \bar{\epsilon}_\theta(x_t, t)) \|^2 \\ & = \frac{\beta_t^2}{2 \alpha_t (1 - \bar{\alpha}_t) \tilde{\beta}_t} \| \bar{\epsilon}_t - \bar{\epsilon}_\theta(x_t, t) \|^2 \\ & = \frac{\beta_t^2}{2 \alpha_t (1 - \bar{\alpha}_t) \tilde{\beta}_t} \| \bar{\epsilon}_t - \bar{\epsilon}_\theta(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{ 1 - \bar{\alpha}_t } \bar{\epsilon}_t, t) \|^2 \\ & \propto \| \bar{\epsilon}_t - \bar{\epsilon}_\theta(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{ 1 - \bar{\alpha}_t } \bar{\epsilon}_t, t) \|^2 \\ \end{aligned}$

不同于前向过程那样直接可以一步获得目标值，这里需要从对应于时间步 $T$ 的随机噪声 $x_T$ 逐步迭代去噪来获得对应于时间步 $0$ 的目标数据 $x_0$ 。

值得注意的是，如果在前面关于均值的推导中，不将 $x_0$ 替换，那么这里的损失形式实际上可以写成如下对真值图像的逼近（与这里提到的另一种损失推导思路存在相关之处：https://spaces.ac.cn/archives/9164#去噪过程）：
$μ~t=αt(1−αˉt−1)1−αˉtxt+αˉt−1βt1−αˉtx0μ~θ=αt(1−αˉt−1)1−αˉtxt+αˉt−1βt1−αˉtfθ(xt,t)DKL(q(xt−1∣xt,x0)∥pθ(xt−1∣xt))=∥μ~t−μ~θ(xt,t)∥22β~t=12β~t∥αt(1−αˉt−1)1−αˉtxt+αˉt−1βt1−αˉtx0−αt(1−αˉt−1)1−αˉtxt−αˉt−1βt1−αˉtfθ(xt,t)∥2=αˉt−1βt22β~t(1−αˉt)2∥x0−fθ(xt,t)∥2∝∥x0−fθ(xt,t)∥2\begin{aligned} \tilde{\mu}_t & = \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{1-\bar{\alpha}_{t}} x_t + \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_{t} }{1 - \bar{\alpha}_{t}} x_{0} \\ \tilde{\mu}_\theta & = \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{1-\bar{\alpha}_{t}} x_t + \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_{t} }{1 - \bar{\alpha}_{t}} f_\theta(x_t, t) \\ D_\text{KL}(q(\mathbf{x}_{t-1} \vert \mathbf{x}_{t}, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_{t})) & = \frac{\| \tilde{\mu}_t - \tilde{\mu}_\theta(x_t, t) \|^2}{2 \tilde{\beta}_t} \\ & = \frac{1}{2 \tilde{\beta}_t} \| \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{1-\bar{\alpha}_{t}} x_t + \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_{t} }{1 - \bar{\alpha}_{t}} x_{0} - \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{1-\bar{\alpha}_{t}} x_t - \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_{t} }{1 - \bar{\alpha}_{t}} f_\theta(x_t, t) \|^2 \\ & = \frac{\bar{\alpha}_{t-1} \beta_t^2}{2 \tilde{\beta}_t (1 - \bar{\alpha}_t)^2} \| x_0 - f_\theta(x_t, t) \|^2 \\ & \propto \| x_0 - f_\theta(x_t, t) \|^2 \\ \end{aligned}$