当前位置：首页 > news >正文

研读论文《Attention Is All You Need》（13）

news 2025/6/3 12:03:00

原文 26

4 Why Self-Attention

In this section we compare various aspects of self-attention layers to the recurrent and convolutional layers commonly used for mapping one variable-length sequence of symbol representations $(x_1,\cdots,x_n)$ to another sequence of equal length $(z_1,\cdots,z_n)$ , with $x_i, z_i\in \mathbb{R}^d$ , such as a hidden layer in a typical sequence transduction encoder or decoder. Motivating our use of self-attention we consider three desiderata.

翻译

4 为何选择自注意力机制

在本节中，我们将自注意力层与循环层、卷积层进行多维度比较，这些层通常用于将变长符号表示序列 $(x_1,\cdots,x_n)$ 映射为等长序列 $(z_1,\cdots,z_n)$ （其中 $x_i, z_i\in \mathbb{R}^d$ ），例如典型序列转换模型编码器或解码器中的隐藏层。我们激励用户采用自注意力机制，并为此考虑到三个核心要素。

重点句子解析

In this section we compare various aspects of self-attention layers to the recurrent and convolutional layers commonly used for mapping one variable-length sequence of symbol representations $(x_1,\cdots,x_n)$ to another sequence of equal length $(z_1,\cdots,z_n)$ ,with xi, zi∈Rd, such as a hidden layer in a typical sequence transduction encoder or decoder.

【解析】

句子的主干可以简化为：we compare A to B，其中A代表various aspects of self-attention layers(自注意力层的各种维度)，B代表 the recurrent and convolutional layers(循环层与卷积层)。需要注意的是：“compare A to B”有两种含义：把A比作B，把A和B做比较。很明显，这里应该是第二种含义。commonly used for …是表示被动意义的后置定语，修饰the recurrent and convolutional layers，相当于被动语态的定语从句“which are commonly used for …”。如果我们把one variable-length sequence of symbol $(x_1,\cdots,x_n)$ 看作C，把another sequence of equal length $(z_1,\cdots,z_n)$ 看作D，就可以把这一部分简化为：mapping C to D，介词短语with xi, zi∈Rd做状语，表示条件；such as…用于举例说明，其中a hidden layer是中心词，后边的介词短语in a typical sequence transduction encoder or decoder做后置定语。

【参考翻译】

在本节中，我们将自注意力层与循环层、卷积层进行多维度比较，这些层通常用于将变长符号表示序列 $(x_1,\cdots,x_n)$ 映射为等长序列 $(z_1,\cdots,z_n)$ （其中 $x_i, z_i\in \mathbb{R}^d$ ），例如典型序列转换模型编码器或解码器中的隐藏层。

原文 27

One is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.

翻译

一个要素是每层的总计算复杂度；另一个要素是可并行计算量，它通过所需的顺序操作的最少次数来衡量。

重点句子解析

Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.

【解析】

句子的整体结构是：主句+定语从句+方式状语。Another is the amount of computation是“主系表”结构的主句，其中表语中心词是the amount，后边的介词短语of computation做后置定语。that can be parallelized是定语从句，修饰computation。其中that指代computation，并且在定语从句中做主语。as measured by the minimum number…是“连词as+过去分词短语measured by”构成的方式状语，解释主句中“可并行计算量”的衡量方式；其中的介词by表示“通过，凭借”；后边的介词短语of sequential operations做后置定语，修饰the minimum number；句尾的required是过去分词做后置定语，修饰sequential operations。

【参考翻译】

另一个要素是可并行计算量，它通过所需的顺序操作的最少次数来衡量。

原文 28

The third is the path length between long-range dependencies in the network. Learning long-range dependencies is a key challenge in many sequence transduction tasks. One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network. The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies [12]. Hence we also compare the maximum path length between any two input and output positions in networks composed of the different layer types.

翻译

第三个要素是网络中长距离依赖之间的路径长度。学习长距离依赖关系是许多序列转导任务中的关键挑战。影响此类依赖关系学习能力的一个关键因素是：网络中前向传播信号与反向传播信号需要传输的路径长度。输入和输出序列中任意位置组合之间的路径越短，就越容易学习长距离依赖关系[12]。因此，我们还将比较由不同层类型组成的网络中，任意两个输入与输出位置之间的最大路径长度。

重点句子解析

One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network.

【解析】

句子的主干是：One key factor is the length. 原句中的现在分词短语affecting the ability做后置定语，修饰factor；不定式to learn such dependencies也是做后置定语，修饰the ability。介词短语of the paths做后置定语，修饰the length。the paths后边是省略了引导词that/which的定语从句，完整的定语从句应该是：that/which forward and backward signals have to traverse in the network. 从句的主干是：that/which signals have to traverse. 这里的that/which指代the paths，并且做traverse的逻辑宾语。原句中的forward and backward修饰signals，做定语；介词短语in the network做状语，修饰动词traverse，表示地点或范围。have to do (sth.)表示：必须 / 需要做某事。

【参考翻译】

影响此类依赖关系学习能力的一个关键因素是：网络中前向传播信号与反向传播信号需要传输的路径长度。

The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies.

【解析】

这是一个“the+比较级…，the+比较级…”的句型，表示“越…，就越…”。关于这个句型，最简单的例子是“The sooner, the better.(越快越好)”。实际上，我们可以把这句话简化为：The shorter these paths are, the easier it is. (这些路径越短，就越容易)。我们还可以这样改写：If these paths are shorter, it will be easier. 原句中的these paths后边修饰成分很长，并且省略了be动词“are”。两个介词短语between any combination (of positions)和in the (input and output) sequences都是后置定语，分别修饰these paths和positions。这两个介词短语中，of positions是后置定语，修饰combination；input and output共同修饰sequences，也是做定语。后半句中的it是形式主语，真正的主语是不定式结构to learn long-range dependencies。

【参考翻译】

输入和输出序列中任意位置组合之间的路径越短，就越容易学习长距离依赖关系。

Hence we also compare the maximum path length between any two input and output positions in networks composed of the different layer types.

【解析】

这句的主干是：we compare length. 其他内容都是做状语或定语。句首的hence是副词，表示与上文的逻辑关系，意为“因此”；副词also修饰谓语动词compare，做状语；the maximum 和path都是修饰宾语名词length，做定语；介词短语between (any two input and output) positions做后置定语，修饰path length；括号里的any和two以及 input and output都是做定语，修饰positions。介词短语in networks做后置定语，也是修饰positions；过去分词短语composed of the different layer types做后置定语，修饰networks。实际上，我们也可以把这一部分改写为定语从句，即：which/that are composed of the different layer types.其中的be composed of表示“由…组成”。the different 和layer做定语，修饰types。

【参考翻译】

因此，我们还将比较由不同层类型组成的网络中，任意两个输入与输出位置之间的最大路径长度。

原文 29

As noted in Table 1, a self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires $O (n)$ sequential operations. In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence length $n$ is smaller than the representation dimensionality $d$ , which is most often the case with sentence representations used by state-of-the-art models in machine translations, such as word-piece [38] and byte-pair [31] representations. To improve computational performance for tasks involving very long sequences, self-attention could be restricted to considering only a neighborhood of size $r$ in the input sequence centered around the respective output position. This would increase the maximum path length to $O (n / r)$ . We plan to investigate this approach further in future work.

翻译

如表1所示，自注意力层通过恒定数量的顺序执行操作连接所有位置，而循环层需要 $O (n)$ 次顺序操作。就计算复杂度而言，当序列长度 $n$ 小于表征维度 $d$ 时（在机器翻译前沿模型使用的句子表征中，如词片[38]和字节对[31]表征，这种情况最为常见），自注意力层比循环层速度更快。针对超长序列任务的计算性能优化，可将自注意力机制限制为仅考虑输入序列中以各输出位置为中心、大小为 $r$ 的邻域范围。这将使最大路径长度增至 $O (n / r)$ 。我们计划在未来工作中进一步探究这一方法。

重点句子解析

As noted in Table 1, a self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires $O (n)$ sequential operations.

【解析】

句首的As noted in Table 1相当于一个省略的非限制性定语从句，即：As is noted in Table 1.这里的as表示“正如”，指代后边的整个句子。接下来是whereas连接的两个分句。第一个分句的主干是：a self-attention layer connects all positions. 后边的介词短语with a constant number of sequentially executed operations做方式状语，修饰connects。其中，a constant number of (恒定数量的)和sequentially executed(顺序执行的)都是做定语，修饰operations。whereas表示对比或转折关系，意为“然而”。后边的a recurrent layer requires O(n) sequential operations是一个“主谓宾”结构的分句，其中requires是谓语动词，前后分别是主语和宾语。

【参考翻译】

如表1所示，自注意力层通过恒定数量的顺序执行操作连接所有位置，而循环层需要O(n) 次顺序操作。

In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence length n is smaller than the representation dimensionality d, which is most often the case with sentence representations used by state-of-the-art models in machine translations, such as word-piece and byte-pair representations.

【解析】

句子的整体结构是：状语+主句+时间状语从句+非限制性定语从句。句首的介词短语In terms of computational complexity做状语，说明话题范畴；其中的in terms of 是固定短语，表示“就…而言”。self-attention layers are faster than recurrent layers是带有比较结构(…are faster than…)的主句；when the sequence length n is smaller than the representation dimensionality d是when引导的带有比较结构(…is smaller than…)的时间状语从句; which…byte-pair representations是which引导的非限制性定语从句。从句的主干是which is the case. 原句中的副词最高级most often(最常见)做状语，表示事情发生的频率之高；with sentence representations used by…是带有with的一种独立主格结构，即：名词(sentence representations)+过去分词短语(used by…)，做状语，表示情况或条件。其中的used by state-of-the-art models修饰sentence representations，做后置定语，表示“被前沿模型使用”；in machine translations也是后置定语，修饰models；such as…用于举例说明 representations 的具体类型。

【参考翻译】

就计算复杂度而言，当序列长度n小于表征维度d时（在机器翻译前沿模型使用的句子表征中，如词片[38]和字节对[31]表征，这种情况最为常见），自注意力层的运算速度优于循环层。

To improve computational performance for tasks involving very long sequences, self-attention could be restricted to considering only a neighborhood of size r in the input sequence centered around the respective output position.

【解析】

句子的主干是：self-attention could be restricted to considering only a neighborhood. 其中的谓语结构是：could be restricted to…(可以被限制在…范围内)，后边的动名词短语considering…是宾语。介词短语of size r和in the input sequence都是后置定语，也都修饰neighborhood；过去分词短语centered around the respective output position(以各输出位置为中心的)也是做后置定语，修饰neighborhood.

再来看句首的不定式。这个不定式充当目的状语，其核心内容是：To improve computational performance，后边的介词短语for tasks做后置定语，修饰computational performance；现在分词短语involving very long sequences也是后置定语，修饰tasks.

【参考翻译】

针对超长序列任务的计算性能优化，可将自注意力机制限制为仅考虑输入序列中以各输出位置为中心、大小为r的邻域范围。

或：

为了提高涉及超长序列任务的计算性能，可将自注意力机制限制为仅考虑输入序列中以各输出位置为中心、大小为r的邻域范围。

原文 30

A single convolutional layer with kernel width $k < n$ does not connect all pairs of input and output positions. Doing so requires a stack of $O (n / k)$ convolutional layers in the case of contiguous kernels, or $O(\log_k(n))$ in the case of dilated convolutions [18], increasing the length of the longest paths between any two positions in the network. Convolutional layers are generally more expensive than recurrent layers, by a factor of $k$ . Separable convolutions [6], however, decrease the complexity considerably, to $O(k·n·d + n·d^2)$ . Even with $k = n$ , however, the complexity of a separable convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer, the approach we take in our model.

翻译

若使用卷积核宽度为 $k (k < n)$ 的单一卷积层，则无法连接所有输入与输出位置对。要实现这种全覆盖（即connect all pairs），对于连续卷积核，需要堆叠 $O (n / k)$ 层卷积；而对于空洞卷积[18]，则需要 $O(\log_k(n))$ 层，这都会增加网络中任意两点间的最长路径长度。通常，卷积层的计算开销比循环层高k倍。但可分离卷积能把复杂度显著降低至 $O(k·n·d + n·d^2)$ 。即便当 $k = n$ 时，可分离卷积的复杂度仍等同于自注意力层与逐点前馈层的组合——这正是我们在模型中采用的方法。

重点句子解析

Doing so requires a stack of $O (n / k)$ convolutional layers in the case of contiguous kernels, or $O(\log_k(n))$ in the case of dilated convolutions, increasing the length of the longest paths between any two positions in the network.

【解析】

如果我们把a stack of O(n/k) convolutional layers看作A，把O(logk(n)) 看作B，并且暂时忽略两个介词短语“in the case of…”，以及后边的现在分词短语“increasing…”，句子就可以简化为：Doing so requires A or B. 其中，requires是谓语动词，前边的dong so是动名词短语充当主语，后边的A or B是两个并列宾语。“doing so” 指的是“connecting all pairs of input and output positions（连接所有输入和输出位置对）”。a stack of O(n/k) convolutional layers表示“O(n/k) 个卷积层的堆叠”，在句中也可以活译为“堆叠 O(n/k) 个卷积层”。介词短语in the case of做状语，表示“关于，对于；在…情况下”。现在分词短语increasing the length…做结果状语，表示前边的操作所带来的结果。其中，介词短语of the longest paths和between any two positions以及in the network都是做后置定语，分别修饰各自前边的名词。

【参考翻译】

要实现这种全覆盖（即connect all pairs），对于连续卷积核，需要堆叠O(n/k)层卷积；而对于空洞卷积，则需要O(logk(n))层，这都会增加网络中任意两点间的最长路径长度。

Even with $k = n$ , however, the complexity of a separable convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer, the approach we take in our model.

【解析】

句子的结构是：状语+插入语+主体内容+同位语+定语从句。句首的Even with k=n是条件状语；两个逗号之间的however是插入语，加下来是句子的主体内容，最后是同位语the approach和定语从句。

句子的主干是：the complexity is equal to the combination. 原句中的介词短语of a separable convolution做后置定语，修饰the complexity；另一个介词短语of a self-attention layer and a point-wise feed-forward layer可以简化为of A and B，是修饰the combination的后置定语。

the approach是the combination的同位语，对前边的“the combination of a self-attention layer and a point-wise feed-forward layer(自注意力层与逐点前馈层的组合)”进行补充说明。we take in our model是省略了引导词that/which的定语从句，that或which指代the approach，并且在定语从句中做动词take的逻辑宾语，因此可以省略。

【参考翻译】

即便当k=n时，可分离卷积的复杂度仍等同于自注意力层与逐点前馈层的组合——这正是我们在模型中采用的方法。

原文 31

As side benefit, self-attention could yield more interpretable models. We inspect attention distributions from our models and present and discuss examples in the appendix. Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences.

翻译

额外优势在于，自注意力机制可生成更易解释的模型。我们分析了模型中的注意力分布，在附录中展示并讨论了具体示例。结果表明，不仅不同的注意力头能明确学习执行不同任务，许多注意力头还表现出与句法语义结构相关的行为。

重点句子解析

Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences.

【解析】

这句话由两个分句组成，第一个分句采用了部分倒装结构，把助动词 do 提到了主语 individual attention heads 的前边。这是因为否定词(如：neither, hardly, not only等)放在句首的时候，句子常常需要使用部分倒装语序，也就是把be动词或助动词提到主语前边。因为这句话的主语中心词heads是复数形式，而且时态是一般现在时，因此助动词采用的是do，而不是does或did。第二个分句中的many指代many attention heads；appear to do sth.表示“看来…，似乎…”。过去分词短语related to the syntactic and semantic structure of the sentences做后置定语，修饰behavior。其中related to…表示“与…有关的”，介词短语of the sentences做后置定语，修饰the syntactic and semantic structure。

【参考翻译】

(结果表明，) 不仅不同的注意力头能明确学习执行不同任务，许多注意力头还表现出与句法语义结构相关的行为。

人工智能必须掌握的基础知识：