Transformer革命

1. Transformerの概要

1.1 歴史的意義

Vaswani et al. (2017) "Attention Is All You Need" で提案。

革新点：

再帰・畳み込みなしで系列処理
Attentionのみで長距離依存を学習
完全に並列化可能な訓練
機械翻訳でSOTA、その後NLP全般を席巻

1.2 基本構造

オリジナルTransformer：Encoder-Decoder構造

Encoder：6層、各層は Self-Attention + FFN
Decoder：6層、各層は Masked Self-Attention + Cross-Attention + FFN

後の発展：

Encoder-only：BERT系（分類、NER）
Decoder-only：GPT系（生成）
Encoder-Decoder：T5系（Seq2Seq）

主要論文：

Vaswani et al. (2017) "Attention Is All You Need", NeurIPS

2. Self-Attention機構

2.1 Query, Key, Value

入力 $X \in \mathbb{R}^{n \times d}$ から3つの表現を生成：

$$Q = XW_Q, \quad K = XW_K, \quad V = XW_V$$

$W_Q, W_K, W_V \in \mathbb{R}^{d \times d_k}$

直感的理解：

Query：「何を探しているか」
Key：「何を持っているか」
Value：「提供する情報」

2.2 Scaled Dot-Product Attention

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

スケーリング $\sqrt{d_k}$：

$d_k$ が大きいと $QK^T$ の値が大きくなり、softmaxが飽和。スケーリングで安定化。

計算量：$O(n^2 d)$（系列長の二乗）

2.3 Attention重みの解釈

$A = \text{softmax}(QK^T / \sqrt{d_k})$ は $n \times n$ の重み行列。

$A_{ij}$：位置 $i$ が位置 $j$ にどれだけ注目するか。

可視化によりモデルの挙動を解釈可能。

2.4 RNNとの比較

特性	RNN	Self-Attention
長距離依存	$O(n)$ ステップ必要	$O(1)$（直接接続）
並列化	逐次処理	完全並列
計算量	$O(n \cdot d^2)$	$O(n^2 \cdot d)$

3. Multi-Head Attention

3.1 動機

単一のAttentionでは異なる種類の関係を同時に学習困難。

複数の「注意ヘッド」で異なる部分空間の関係を学習。

3.2 定式化

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O$$ $$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$

$W_i^Q, W_i^K, W_i^V \in \mathbb{R}^{d \times d_k}$、$W^O \in \mathbb{R}^{hd_k \times d}$

通常 $h = 8$、$d_k = d/h = 64$（$d = 512$ の場合）

3.3 各ヘッドの役割

学習により異なるヘッドが異なる関係を捉える：

構文的関係（主語-動詞）
意味的関係（同義語）
位置的関係（近接トークン）
特定のパターン（括弧の対応など）

主要論文：

Clark et al. (2019) "What Does BERT Look At? An Analysis of BERT's Attention", BlackboxNLP
Voita et al. (2019) "Analyzing Multi-Head Self-Attention", ACL

4. 位置エンコーディング

4.1 必要性

Self-Attentionは位置に不変（順序を区別しない）。

順序情報を明示的に注入する必要がある。

4.2 正弦波エンコーディング（オリジナル）

$$PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d})$$ $$PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d})$$

特徴：

学習不要、任意の長さに対応
相対位置を線形変換で表現可能
周期的だが、異なる周波数の組み合わせで位置を一意に表現

4.3 学習可能な位置エンコーディング

BERT、GPT等で採用。$PE \in \mathbb{R}^{L \times d}$ を学習。

最大系列長 $L$ までに制限。正弦波と同等の性能。

4.4 相対位置エンコーディング

Shaw et al. (2018)：

Attentionスコアに相対位置バイアスを追加。

ALiBi（Press et al., 2022）：

位置埋め込みなし。Attentionに距離ペナルティを加算。

$$\text{softmax}(QK^T - m \cdot |i-j|)$$

長文脈への外挿に優れる。

RoPE（Su et al., 2021）：

回転行列による位置エンコーディング。

$$q_m = R_m q, \quad k_n = R_n k$$

LLaMA、GPT-NeoX等で採用。長文脈対応と性能のバランス。

主要論文：

Shaw et al. (2018) "Self-Attention with Relative Position Representations", NAACL
Press et al. (2022) "Train Short, Test Long: Attention with Linear Biases (ALiBi)", ICLR
Su et al. (2021) "RoFormer: Enhanced Transformer with Rotary Position Embedding", arXiv

5. Transformerブロック

5.1 Feed-Forward Network（FFN）

$$\text{FFN}(x) = \text{GELU}(xW_1 + b_1)W_2 + b_2$$

通常 $d_{ff} = 4d$（隠れ層を4倍に拡大）。

位置ごとに独立に適用（point-wise）。

解釈：各位置で非線形変換を適用。Attention（混合）+ FFN（変換）の組み合わせ。

5.2 残差接続とLayerNorm

Post-LN（オリジナル）：

$$x = \text{LayerNorm}(x + \text{SubLayer}(x))$$

Pre-LN（GPT-2以降）：

$$x = x + \text{SubLayer}(\text{LayerNorm}(x))$$

Pre-LNが訓練安定性で優位。現在の標準。

主要論文：

Xiong et al. (2020) "On Layer Normalization in the Transformer Architecture", ICML

5.3 Encoderブロック

Multi-Head Self-Attention
残差接続 + LayerNorm
Feed-Forward Network
残差接続 + LayerNorm

5.4 Decoderブロック

Masked Multi-Head Self-Attention（未来を見ない）
残差接続 + LayerNorm
Cross-Attention（Encoderの出力を参照）
残差接続 + LayerNorm
Feed-Forward Network
残差接続 + LayerNorm

5.5 Causal Masking

自己回帰生成のため、位置 $i$ は位置 $j > i$ を参照できない。

Attentionマスク：$M_{ij} = -\infty$ if $j > i$

6. 主要変種：BERT、GPT、T5

6.1 BERT（Encoder-only）

Devlin et al. (2019)。双方向文脈理解。

事前学習タスク：

Masked Language Model（MLM）：15%のトークンをマスクし予測
Next Sentence Prediction（NSP）：2文の連続性を判定

用途：分類、NER、QA、文埋め込み

変種：RoBERTa、ALBERT、DistilBERT、DeBERTa

主要論文：

Devlin et al. (2019) "BERT: Pre-training of Deep Bidirectional Transformers", NAACL
Liu et al. (2019) "RoBERTa: A Robustly Optimized BERT Pretraining Approach", arXiv

6.2 GPT（Decoder-only）

Radford et al. (2018, 2019)。自己回帰言語モデル。

事前学習：次トークン予測（Causal LM）

$$P(x_t | x_1, \ldots, x_{t-1})$$

発展：

GPT-2：1.5B、Zero-shotの可能性を示す
GPT-3：175B、Few-shot Learning、In-Context Learning
GPT-4：マルチモーダル、RLHF

主要論文：

Radford et al. (2018) "Improving Language Understanding by Generative Pre-Training"
Brown et al. (2020) "Language Models are Few-Shot Learners (GPT-3)", NeurIPS

6.3 T5（Encoder-Decoder）

Raffel et al. (2020)。全タスクをText-to-Textで統一。

特徴：

分類も生成も同じフォーマット
相対位置エンコーディング
C4データセット

変種：mT5（多言語）、FLAN-T5（Instruction Tuning）

主要論文：

Raffel et al. (2020) "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer", JMLR

7. 現代的改良

7.1 効率化

FlashAttention（Dao et al., 2022）

メモリ効率の劇的な改善。IO-aware実装。

長文脈処理を実用的に。

Sparse Attention

全位置ペアでなく、選択的なペアのみ計算。

Longformer、BigBird。

主要論文：

Dao et al. (2022) "FlashAttention: Fast and Memory-Efficient Exact Attention", NeurIPS
Beltagy et al. (2020) "Longformer: The Long-Document Transformer", arXiv

7.2 アーキテクチャ改良

Grouped Query Attention（GQA）

K, Vのヘッドを共有。推論効率向上。LLaMA 2で採用。

SwiGLU

FFNの活性化関数を改良。

$$\text{SwiGLU}(x) = (\text{Swish}(xW_1) \odot xV)W_2$$

LLaMA、PaLMで採用。

RMSNorm

LayerNormの簡略版。平均の計算を省略。

主要論文：

Ainslie et al. (2023) "GQA: Training Generalized Multi-Query Transformer Models", EMNLP
Shazeer (2020) "GLU Variants Improve Transformer", arXiv

7.3 長文脈対応

RoPE + NTK-aware scaling：文脈長の拡張
Sliding Window Attention：局所注意 + グローバルトークン
Ring Attention：分散処理での長文脈

7.4 最新LLMアーキテクチャ

モデル	主要特徴
LLaMA 2/3	RoPE, GQA, SwiGLU, RMSNorm
Mistral	Sliding Window, GQA
Mixtral	Mixture of Experts
Claude	Constitutional AI, 長文脈

8. 参考文献

必読論文

Vaswani et al. (2017) "Attention Is All You Need", NeurIPS
Devlin et al. (2019) "BERT", NAACL
Brown et al. (2020) "GPT-3", NeurIPS
Raffel et al. (2020) "T5", JMLR

解説リソース

Jay Alammar "The Illustrated Transformer"
Lilian Weng "The Transformer Family"
Harvard NLP "The Annotated Transformer"

教科書

Jurafsky & Martin (2024) "Speech and Language Processing", Chapter 10
Tunstall et al. (2022) "Natural Language Processing with Transformers", O'Reilly

1. Transformerの概要

1.1 歴史的意義

1.2 基本構造

2. Self-Attention機構

2.1 Query, Key, Value

2.2 Scaled Dot-Product Attention

2.3 Attention重みの解釈

2.4 RNNとの比較

3. Multi-Head Attention

3.1 動機

3.2 定式化

3.3 各ヘッドの役割

4. 位置エンコーディング

4.1 必要性

4.2 正弦波エンコーディング（オリジナル）

4.3 学習可能な位置エンコーディング

4.4 相対位置エンコーディング

5. Transformerブロック

5.1 Feed-Forward Network（FFN）

5.2 残差接続とLayerNorm

5.3 Encoderブロック

5.4 Decoderブロック

5.5 Causal Masking

6. 主要変種：BERT、GPT、T5

6.1 BERT（Encoder-only）

6.2 GPT（Decoder-only）

6.3 T5（Encoder-Decoder）

7. 現代的改良

7.1 効率化

7.2 アーキテクチャ改良

7.3 長文脈対応

7.4 最新LLMアーキテクチャ

8. 参考文献

必読論文

解説リソース

教科書

関連ページ