05. 機械学習の基礎概念 - AI入門 - はとはとプロジェクト

1. 学習問題の定式化

1.1 経験的リスク最小化（ERM）

機械学習の基本フレームワーク。未知の分布 $P(X, Y)$ からのサンプル $\{(x_i, y_i)\}_{i=1}^n$ を用いて、期待損失（リスク）を最小化する仮説 $h$ を見つける。

真のリスク：

$$R(h) = \mathbb{E}_{(X,Y) \sim P}[\ell(h(X), Y)]$$

経験的リスク：

$$\hat{R}_n(h) = \frac{1}{n}\sum_{i=1}^n \ell(h(x_i), y_i)$$

主要論文：

Vapnik (1991) "Principles of Risk Minimization for Learning Theory", NIPS
Vapnik (1998) "Statistical Learning Theory", Wiley

1.2 i.i.d.仮定とその緩和

古典的MLはデータの独立同一分布（i.i.d.）を仮定。現実では分布シフト、時系列依存、選択バイアスが存在。ドメイン適応、継続学習、因果推論による対処。

主要論文：

Shimodaira (2000) "Improving predictive inference under covariate shift", Journal of Statistical Planning and Inference
Ben-David et al. (2010) "A theory of learning from different domains", Machine Learning

2. データ分割と評価

2.1 訓練/検証/テスト分割

データを3分割する標準的手法。訓練セットでモデル学習、検証セットでハイパーパラメータ調整、テストセットで最終評価。情報リークの防止が重要。

訓練セット（60-80%）：パラメータ学習
検証セット（10-20%）：モデル選択、早期停止
テストセット（10-20%）：汎化性能の最終評価

2.2 交差検証（Cross-Validation）

データを$k$分割し、各分割を順にテストセットとして使用。分散の推定が可能。計算コストと精度のトレードオフ。

変種：

k-fold CV：標準的手法（$k=5$ or $10$が一般的）
Leave-One-Out（LOO）：$k=n$、バイアス小だが分散大
Stratified CV：クラス比率を保持
Time Series CV：時間順序を尊重

主要論文：

Stone (1974) "Cross-Validatory Choice and Assessment of Statistical Predictions", JRSS
Arlot & Celisse (2010) "A survey of cross-validation procedures for model selection", Statistics Surveys

2.3 評価指標

分類：正解率、精度、再現率、F1、AUC-ROC、Log Loss

回帰：MSE、MAE、$R^2$、MAPE

ランキング：NDCG、MAP、MRR

主要論文：

Hand & Till (2001) "A Simple Generalisation of the Area Under the ROC Curve", Machine Learning

3. 過学習と正則化

3.1 過学習（Overfitting）

訓練データに過度に適合し、汎化性能が低下する現象。モデル複雑度、データ量、ノイズレベルに依存。

兆候：訓練誤差は低いがテスト誤差が高い（汎化ギャップ）

3.2 正則化手法

パラメータ正則化：

$L_2$正則化（Ridge / Weight Decay）：$\lambda \|w\|_2^2$
$L_1$正則化（Lasso）：$\lambda \|w\|_1$、スパース解を誘導
Elastic Net：$L_1 + L_2$の組み合わせ

暗黙的正則化：

早期停止（Early Stopping）：暗黙の正則化効果
Dropout：ランダムにユニットを無効化、アンサンブル効果
データ拡張：訓練データを人工的に拡張
バッチ正規化：正則化と最適化の改善

主要論文：

Tibshirani (1996) "Regression Shrinkage and Selection via the Lasso", JRSS
Srivastava et al. (2014) "Dropout: A Simple Way to Prevent Neural Networks from Overfitting", JMLR
Ioffe & Szegedy (2015) "Batch Normalization", ICML

4. バイアス・バリアンス分解

4.1 古典的分解

期待二乗誤差の分解（回帰の場合）：

$$\mathbb{E}[(y - \hat{f}(x))^2] = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$$

バイアス：モデルの仮定による系統的誤差
バリアンス：データの変動に対する感度
既約誤差：データ固有のノイズ

トレードオフ：複雑なモデル→低バイアス・高バリアンス、単純なモデル→高バイアス・低バリアンス

4.2 深層学習時代の再解釈

過パラメータ化されたモデルでの「Double Descent」現象により、古典的トレードオフが再検討されている。

主要論文：

Geman et al. (1992) "Neural Networks and the Bias/Variance Dilemma", Neural Computation
Belkin et al. (2019) "Reconciling modern machine-learning practice and the classical bias–variance trade-off", PNAS
Nakkiran et al. (2020) "Deep Double Descent: Where Bigger Models and More Data Can Hurt", ICLR

4.3 良性過学習（Benign Overfitting）

補間学習器（訓練誤差ゼロ）でも良好な汎化を達成する現象。高次元設定での新しい理解。

主要論文：

Bartlett et al. (2020) "Benign overfitting in linear regression", PNAS
Hastie et al. (2022) "Surprises in High-Dimensional Ridgeless Least Squares Interpolation", Annals of Statistics

5. モデル選択

5.1 情報量規準

尤度とモデル複雑度のバランス。交差検証の近似として機能。

AIC（赤池情報量規準）：$-2\log L + 2k$
BIC（ベイズ情報量規準）：$-2\log L + k\log n$
MDL（最小記述長）：Kolmogorov複雑性の近似

主要論文：

Akaike (1974) "A new look at the statistical model identification", IEEE Trans. Automatic Control
Schwarz (1978) "Estimating the Dimension of a Model", Annals of Statistics
Rissanen (1978) "Modeling by shortest data description", Automatica

5.2 ハイパーパラメータ最適化

グリッドサーチ：網羅的探索
ランダムサーチ：高次元で効率的（Bergstra & Bengio, 2012）
ベイズ最適化：ガウス過程による効率的探索
Hyperband / ASHA：早期打ち切りによる高速化

主要論文：

Bergstra & Bengio (2012) "Random Search for Hyper-Parameter Optimization", JMLR
Snoek et al. (2012) "Practical Bayesian Optimization of Machine Learning Algorithms", NIPS
Li et al. (2018) "Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization", JMLR

6. 特徴量工学と表現学習

6.1 古典的特徴量工学

ドメイン知識に基づく手動の特徴設計。正規化、エンコーディング、交互作用項、多項式特徴。

技法：

スケーリング（StandardScaler, MinMaxScaler）
カテゴリエンコーディング（One-hot, Target encoding）
欠損値処理（Imputation）
次元削減（PCA, t-SNE, UMAP）

6.2 表現学習への移行

深層学習による自動的な特徴抽出。End-to-end学習。事前学習+ファインチューニングパラダイム。

主要論文：

Bengio et al. (2013) "Representation Learning: A Review and New Perspectives", IEEE TPAMI

6.3 転移学習と事前学習

事前学習されたモデルの再利用。ImageNet事前学習、言語モデル事前学習（BERT, GPT）。

主要論文：

Yosinski et al. (2014) "How transferable are features in deep neural networks?", NIPS
Howard & Ruder (2018) "Universal Language Model Fine-tuning for Text Classification", ACL

7. 参考文献

教科書

Hastie, Tibshirani & Friedman (2009) "The Elements of Statistical Learning", 2nd ed., Springer
Bishop (2006) "Pattern Recognition and Machine Learning", Springer
Murphy (2022) "Probabilistic Machine Learning: An Introduction", MIT Press
Shalev-Shwartz & Ben-David (2014) "Understanding Machine Learning: From Theory to Algorithms", Cambridge

サーベイ

Domingos (2012) "A Few Useful Things to Know About Machine Learning", CACM
Goodfellow et al. (2016) "Deep Learning", MIT Press（Chapter 5: Machine Learning Basics）