Attention機構の理論と実装

1. Attentionの歴史

1.1 発展の経緯

2014：Seq2SeqでのAttention（Bahdanau et al.）
2015：Global/Local Attention（Luong et al.）
2016：Self-Attention（Cheng et al.、Parikh et al.）
2017："Attention Is All You Need"（Vaswani et al.）

1.2 Attentionの本質

核心アイデア：関連性に応じて情報を重み付け。

固定長ボトルネックの解消、長距離依存の直接的なモデリング。

2. Seq2Seq Attention

2.1 背景：Encoder-Decoder

機械翻訳の標準アーキテクチャ（2014年）：

Encoder：入力系列→固定長ベクトル
Decoder：固定長ベクトル→出力系列

問題：固定長ボトルネック。長い入力で情報損失。

2.2 Bahdanau Attention

デコーダの各ステップで、エンコーダ出力を動的に参照。

アライメントスコア：

$$e_{ij} = a(s_{i-1}, h_j)$$

$s_{i-1}$：デコーダの前状態、$h_j$：エンコーダのj番目出力

$a$：アライメントモデル（小さなNN）

Attention重み：

$$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_k \exp(e_{ik})}$$

文脈ベクトル：

$$c_i = \sum_j \alpha_{ij} h_j$$

主要論文：

Bahdanau et al. (2015) "Neural Machine Translation by Jointly Learning to Align and Translate", ICLR

2.3 Luong Attention

簡略化されたスコア関数：

Dot：$e_{ij} = s_i^T h_j$
General：$e_{ij} = s_i^T W_a h_j$
Concat：$e_{ij} = v_a^T \tanh(W_a[s_i; h_j])$

主要論文：

Luong et al. (2015) "Effective Approaches to Attention-based Neural Machine Translation", EMNLP

3. Self-Attention

3.1 定義

同一系列内での位置間のAttention。エンコーダとデコーダが同じ系列。

入力 $X = (x_1, ..., x_n)$ に対し、各位置が全位置を参照。

3.2 Scaled Dot-Product Attention

Transformer（2017）の中核。

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

$Q \in \mathbb{R}^{n \times d_k}$：Query行列
$K \in \mathbb{R}^{n \times d_k}$：Key行列
$V \in \mathbb{R}^{n \times d_v}$：Value行列

3.3 スケーリングの必要性

$d_k$が大きいと、$QK^T$の分散が大きくなる。

Softmaxが飽和→勾配が小さくなる。

$\sqrt{d_k}$で割ることで分散を1に正規化。

直感：

$q \cdot k = \sum_i q_i k_i$。各項の分散が1なら、和の分散は$d_k$。

$\sqrt{d_k}$で割れば分散1に。

3.4 計算量

$QK^T$：$O(n^2 d_k)$
Softmax：$O(n^2)$
Attention × V：$O(n^2 d_v)$

全体：$O(n^2 d)$（系列長の二乗）

4. Query-Key-Valueの理解

4.1 情報検索のアナロジー

Query：検索クエリ（何を探しているか）
Key：インデックス（何を持っているか）
Value：実際のコンテンツ（提供する情報）

プロセス：

QueryとKeyの類似度を計算（検索）
類似度をSoftmaxで正規化（関連性の重み）
重みでValueを加重和（情報集約）

4.2 Q, K, Vの生成

入力$X$から線形射影：

$$Q = XW^Q, \quad K = XW^K, \quad V = XW^V$$

$W^Q, W^K \in \mathbb{R}^{d \times d_k}$、$W^V \in \mathbb{R}^{d \times d_v}$

なぜ3つの異なる射影？

Query：「この位置が探す情報」の表現
Key：「この位置が提供できる情報」のインデックス
Value：「実際に提供する情報」

異なる役割に異なる表現空間を使用。

4.3 Self-AttentionとCross-Attention

Self-Attention：

Q, K, V全てが同じ系列から生成。

Cross-Attention：

Qはデコーダから、K, Vはエンコーダから。

例：翻訳でターゲット言語がソース言語を参照。

5. Multi-Head Attention

5.1 動機

単一のAttentionでは異なる種類の関係を同時に学習困難。

複数の「注意ヘッド」で異なる部分空間の関係を並列学習。

5.2 定式化

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$ $$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$

パラメータ：

$W_i^Q, W_i^K \in \mathbb{R}^{d \times d_k}$
$W_i^V \in \mathbb{R}^{d \times d_v}$
$W^O \in \mathbb{R}^{hd_v \times d}$

通常：$h = 8$、$d_k = d_v = d/h = 64$（$d = 512$の場合）

5.3 各ヘッドの役割

学習により異なるパターンを捉える：

構文関係（主語-動詞、修飾語-被修飾語）
意味関係（同義語、反義語）
位置関係（隣接、遠距離）
特定パターン（引用符、括弧の対応）

5.4 計算効率

単一の大きなAttention vs 複数の小さなAttention：

計算量は同等だが、Multi-Headは並列化が容易で表現力が高い。

6. Attentionの変種

6.1 Causal（Masked）Attention

自己回帰生成のため、未来の位置を参照しない。

$$\text{Mask}_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}$$ $$\text{Attention} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + \text{Mask}\right)V$$

GPT系のDecoder-onlyモデルで使用。

6.2 Bidirectional Attention

全位置を双方向に参照。BERTのEncoder。

分類、NERなど非生成タスクに適する。

6.3 Local/Sliding Window Attention

各位置が固定ウィンドウ内のみ参照。

計算量：$O(n \cdot w)$（$w$はウィンドウサイズ）

Mistral、Longformerで採用。

6.4 Sparse Attention

一部のKey-Valueペアのみ計算。

Fixed patterns：ローカル + グローバル
Learned patterns：学習により決定

BigBird、Longformerで採用。

6.5 Linear Attention

Softmaxを近似し、$O(n)$に。

$$\text{Attention}(Q, K, V) = \phi(Q)(\phi(K)^T V)$$

$\phi$：特徴写像。精度とのトレードオフ。

6.6 GQA（Grouped Query Attention）

→ 詳細はアーキテクチャの進化

7. 解釈と可視化

7.1 Attention重みの意味

$A = \text{softmax}(QK^T/\sqrt{d_k})$

$A_{ij}$：位置$i$が位置$j$に払う「注意」の量。

注意：Attention重み≠因果関係。解釈には慎重さが必要。

7.2 可視化手法

Attention Heatmap：行列を色で表示
BertViz：インタラクティブ可視化ツール
Head別分析：各ヘッドの役割を調査

7.3 発見された現象

Positional heads：隣接トークンに強く注目
Syntactic heads：構文関係を捉える
Rare word heads：稀少語に注目
Delimiter heads：句読点、区切りに注目

7.4 Attention Flow

層間のAttentionの伝播を追跡。

Attention Rollout：層ごとのAttentionを累積。

主要論文：

Clark et al. (2019) "What Does BERT Look At?", BlackboxNLP
Voita et al. (2019) "Analyzing Multi-Head Self-Attention", ACL
Abnar & Zuidema (2020) "Quantifying Attention Flow in Transformers", ACL

8. 参考文献

基礎論文

Bahdanau et al. (2015) "Neural Machine Translation by Jointly Learning to Align and Translate", ICLR
Luong et al. (2015) "Effective Approaches to Attention-based NMT", EMNLP
Vaswani et al. (2017) "Attention Is All You Need", NeurIPS

解析・解釈

Clark et al. (2019) "What Does BERT Look At?", BlackboxNLP
Voita et al. (2019) "Analyzing Multi-Head Self-Attention", ACL
Jain & Wallace (2019) "Attention is not Explanation", NAACL

効率化

Tay et al. (2020) "Efficient Transformers: A Survey", arXiv
Dao et al. (2022) "FlashAttention", NeurIPS

1. Attentionの歴史

1.1 発展の経緯

1.2 Attentionの本質

2. Seq2Seq Attention

2.1 背景：Encoder-Decoder

2.2 Bahdanau Attention

2.3 Luong Attention

3. Self-Attention

3.1 定義

3.2 Scaled Dot-Product Attention

3.3 スケーリングの必要性

3.4 計算量

4. Query-Key-Valueの理解

4.1 情報検索のアナロジー

4.2 Q, K, Vの生成

4.3 Self-AttentionとCross-Attention

5. Multi-Head Attention

5.1 動機

5.2 定式化

5.3 各ヘッドの役割

5.4 計算効率

6. Attentionの変種

6.1 Causal（Masked）Attention

6.2 Bidirectional Attention

6.3 Local/Sliding Window Attention

6.4 Sparse Attention

6.5 Linear Attention

6.6 GQA（Grouped Query Attention）

7. 解釈と可視化

7.1 Attention重みの意味

7.2 可視化手法

7.3 発見された現象

7.4 Attention Flow

8. 参考文献

基礎論文

解析・解釈

効率化

関連ページ