06. 教師あり学習 - AI入門 - はとはとプロジェクト

1. 問題の定式化

1.1 分類（Classification）

入力 $x \in \mathcal{X}$ を離散ラベル $y \in \{1, \ldots, K\}$ にマッピング。

二値分類：$K=2$、スパム検出、医療診断
多クラス分類：$K > 2$、画像認識、文書分類
多ラベル分類：複数ラベルを同時に予測

損失関数：0-1損失、クロスエントロピー、ヒンジ損失

1.2 回帰（Regression）

入力 $x$ から連続値 $y \in \mathbb{R}$ を予測。

損失関数：二乗誤差（MSE）、絶対誤差（MAE）、Huber損失

1.3 構造化予測

系列、木、グラフなどの構造化出力を予測。系列ラベリング（NER, POS）、構文解析。

主要論文：

Lafferty et al. (2001) "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data", ICML
Taskar et al. (2004) "Max-Margin Markov Networks", NIPS

2. 線形モデル

2.1 線形回帰

$\hat{y} = w^T x + b$、最小二乗法で解析解。

$$w^* = (X^T X)^{-1} X^T y$$

正則化版：

Ridge回帰：$(X^T X + \lambda I)^{-1} X^T y$
Lasso：$L_1$正則化、スパース解（Tibshirani, 1996）
Elastic Net：$L_1 + L_2$の組み合わせ

2.2 ロジスティック回帰

分類のための線形モデル。シグモイド関数による確率出力。

$$P(y=1|x) = \sigma(w^T x + b) = \frac{1}{1 + e^{-(w^T x + b)}}$$

最尤推定、勾配降下法またはニュートン法で最適化。

2.3 一般化線形モデル（GLM）

線形予測子とリンク関数の組み合わせ。指数型分布族。

主要論文：

Nelder & Wedderburn (1972) "Generalized Linear Models", JRSS
McCullagh & Nelder (1989) "Generalized Linear Models", Chapman & Hall

3. カーネル法・SVM

3.1 カーネルトリック

高次元特徴空間への暗黙的マッピング。内積のみで計算可能。

$$K(x, x') = \langle \phi(x), \phi(x') \rangle$$

代表的カーネル：

RBF（ガウシアン）：$K(x, x') = \exp(-\gamma \|x - x'\|^2)$
多項式：$K(x, x') = (x^T x' + c)^d$
線形：$K(x, x') = x^T x'$

主要論文：

Aizerman et al. (1964) "Theoretical foundations of the potential function method"
Schölkopf & Smola (2002) "Learning with Kernels", MIT Press

3.2 サポートベクターマシン（SVM）

最大マージン分類器。サポートベクターのみが決定境界を決定。

主問題：

$$\min_{w, b} \frac{1}{2}\|w\|^2 + C \sum_i \xi_i$$ $$\text{s.t. } y_i(w^T x_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0$$

双対問題：

$$\max_\alpha \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j K(x_i, x_j)$$

主要論文：

Cortes & Vapnik (1995) "Support-Vector Networks", Machine Learning
Vapnik (1998) "Statistical Learning Theory", Wiley
Platt (1998) "Sequential Minimal Optimization", Microsoft Research

3.3 カーネル法の理論

再生核ヒルベルト空間（RKHS）、表現定理、汎化理論。

主要論文：

Aronszajn (1950) "Theory of Reproducing Kernels", Trans. AMS
Wahba (1990) "Spline Models for Observational Data", SIAM

4. 決定木とアンサンブル

4.1 決定木

再帰的な特徴空間の分割。解釈可能だが過学習しやすい。

分割基準：情報ゲイン、ジニ不純度、分散削減

アルゴリズム：ID3, C4.5, CART

主要論文：

Quinlan (1986) "Induction of Decision Trees", Machine Learning
Breiman et al. (1984) "Classification and Regression Trees", CRC Press

4.2 ランダムフォレスト

バギング + 特徴のランダムサブサンプリング。並列化可能、過学習に強い。

主要論文：

Breiman (2001) "Random Forests", Machine Learning

4.3 勾配ブースティング

逐次的に弱学習器を追加し、残差を学習。高精度だがチューニングが重要。

実装：XGBoost, LightGBM, CatBoost

主要論文：

Friedman (2001) "Greedy Function Approximation: A Gradient Boosting Machine", Annals of Statistics
Chen & Guestrin (2016) "XGBoost: A Scalable Tree Boosting System", KDD
Ke et al. (2017) "LightGBM: A Highly Efficient Gradient Boosting Decision Tree", NIPS

4.4 アンサンブル理論

バギング（分散削減）、ブースティング（バイアス削減）、スタッキング。

主要論文：

Breiman (1996) "Bagging Predictors", Machine Learning
Freund & Schapire (1997) "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting", JCSS
Wolpert (1992) "Stacked Generalization", Neural Networks

5. ニューラルネットワーク

5.1 多層パーセプトロン（MLP）

汎用的な関数近似器。万能近似定理（Cybenko, 1989; Hornik et al., 1989）。

→ 詳細は 03. ニューラルネットワークの理論的基盤

5.2 畳み込みニューラルネットワーク（CNN）

画像認識の標準手法。局所結合、重み共有、プーリング。

主要論文：

LeCun et al. (1998) "Gradient-based learning applied to document recognition", Proceedings of the IEEE
Krizhevsky et al. (2012) "ImageNet Classification with Deep Convolutional Neural Networks", NIPS
He et al. (2016) "Deep Residual Learning for Image Recognition", CVPR

5.3 Transformer

自然言語処理の標準アーキテクチャ。Self-Attention機構。ビジョン、音声にも拡張。

→ 詳細は Transformer革命

6. 確率的モデル

6.1 ナイーブベイズ

条件付き独立仮定に基づく分類器。シンプルだが高速・ロバスト。

$$P(y|x) \propto P(y) \prod_j P(x_j|y)$$

6.2 ガウス過程（GP）

関数上の事前分布。不確実性の定量化が可能。カーネル関数で事前知識を表現。

主要論文：

Rasmussen & Williams (2006) "Gaussian Processes for Machine Learning", MIT Press

6.3 ベイズニューラルネットワーク

重みに事後分布。不確実性推定。近似推論が必要（変分推論、MCドロップアウト）。

主要論文：

Neal (1996) "Bayesian Learning for Neural Networks", Springer
Gal & Ghahramani (2016) "Dropout as a Bayesian Approximation", ICML
Blundell et al. (2015) "Weight Uncertainty in Neural Networks", ICML

7. 手法選択の指針

7.1 データサイズと次元

小データ・低次元：SVM、ランダムフォレスト、勾配ブースティング
大データ・高次元：ニューラルネットワーク、線形モデル
構造化データ（表形式）：勾配ブースティングが強力
非構造化データ（画像・テキスト）：深層学習

7.2 解釈可能性の要求

高い解釈可能性：線形モデル、決定木、ルールベース
事後的解釈：SHAP、LIME、Attention可視化

7.3 不確実性定量化

必要な場合：ガウス過程、ベイズニューラルネットワーク、アンサンブル
Conformal Prediction：分布フリーの予測区間

主要論文：

Lundberg & Lee (2017) "A Unified Approach to Interpreting Model Predictions (SHAP)", NIPS
Ribeiro et al. (2016) "Why Should I Trust You? Explaining the Predictions of Any Classifier (LIME)", KDD

8. 参考文献

教科書

Hastie, Tibshirani & Friedman (2009) "The Elements of Statistical Learning", Springer
Bishop (2006) "Pattern Recognition and Machine Learning", Springer
Murphy (2012) "Machine Learning: A Probabilistic Perspective", MIT Press

サーベイ・チュートリアル

Fernández-Delgado et al. (2014) "Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?", JMLR
Caruana & Niculescu-Mizil (2006) "An Empirical Comparison of Supervised Learning Algorithms", ICML

ベンチマーク

OpenML: オープンな機械学習プラットフォーム
UCI Machine Learning Repository: 古典的データセット集
Kaggle: コンペティションと実世界データセット