第8章：パフォーマンス最適化

更新日：2025年12月9日

本章では、Pythonアプリケーションのパフォーマンス最適化手法を解説する。cProfile/py-spyによるプロファイリング、multiprocessing/concurrent.futuresによる並列処理、Cython/Numbaによるコンパイル最適化、CuPy/TorchによるGPU活用、メモリ効率化のテクニックについて学ぶ。「推測するな、計測せよ」の原則に従い、ボトルネックを特定してから最適化することが重要である。

1. プロファイリング

最適化の第一歩はボトルネックの特定である。推測ではなく計測に基づいて最適化を行う[1]。

1.1 cProfile

Python標準のプロファイラ。関数呼び出しの統計を収集する。

import cProfile
import pstats
from io import StringIO

def slow_function():
    total = 0
    for i in range(1000000):
        total += i ** 2
    return total

def main():
    for _ in range(10):
        slow_function()

# プロファイリング実行
profiler = cProfile.Profile()
profiler.enable()

main()

profiler.disable()

# 結果の表示
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(20)

# コマンドラインから実行
# python -m cProfile -s cumulative script.py

# 出力例:
#    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
#        10    2.345    0.234    2.345    0.234 script.py:5(slow_function)
#         1    0.001    0.001    2.346    2.346 script.py:10(main)

Table 1. cProfile出力の読み方

カラム	意味
ncalls	呼び出し回数
tottime	関数自体の実行時間（子関数除く）
cumtime	累積実行時間（子関数含む）
percall	1回あたりの時間

1.2 py-spy

サンプリングベースのプロファイラ。本番環境でも低オーバーヘッドで使用可能[2]。

# インストール
# pip install py-spy

# 実行中のプロセスをプロファイル
# py-spy top --pid 12345

# フレームグラフ生成
# py-spy record -o profile.svg -- python script.py

# Dockerコンテナ内のプロセス
# py-spy top --pid 12345 -- docker exec container_name

1.3 line_profiler

行単位で実行時間を計測。ボトルネック行の特定に有効。

# pip install line_profiler

from line_profiler import profile

@profile
def compute_heavy():
    result = []
    for i in range(10000):
        result.append(i ** 2)      # この行が遅い？
    total = sum(result)             # この行が遅い？
    return total

# 実行
# kernprof -l -v script.py

# 出力例:
# Line #      Hits         Time  Per Hit   % Time  Line Contents
#      5     10000      15000.0      1.5     60.0      result.append(i ** 2)
#      6         1      10000.0  10000.0     40.0      total = sum(result)

Fig. 1にプロファイリングのワークフローを示す。

2. 並列処理

PythonのGILにより、CPUバウンドな処理ではmultiprocessingが有効。I/Oバウンドではthreadingやasyncioが適する。

2.1 multiprocessing

プロセスベースの並列処理。GILの制約を回避できる。

from multiprocessing import Pool, cpu_count
import time

def cpu_bound_task(n: int) -> int:
    """CPUバウンドな処理"""
    total = 0
    for i in range(n):
        total += i ** 2
    return total

# 直列処理
def sequential():
    results = []
    for i in range(8):
        results.append(cpu_bound_task(1_000_000))
    return results

# 並列処理
def parallel():
    with Pool(processes=cpu_count()) as pool:
        results = pool.map(cpu_bound_task, [1_000_000] * 8)
    return results

# 性能比較
start = time.time()
sequential()
print(f"Sequential: {time.time() - start:.2f}s")

start = time.time()
parallel()
print(f"Parallel: {time.time() - start:.2f}s")

# 典型的な結果（8コアCPU）:
# Sequential: 4.00s
# Parallel: 0.60s (約6.7倍高速)

2.2 concurrent.futures

高レベルな並列処理API。ProcessPoolExecutorとThreadPoolExecutorを提供。

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
import requests

# CPUバウンド: ProcessPoolExecutor
def process_data(data: list) -> list:
    return [x ** 2 for x in data]

with ProcessPoolExecutor(max_workers=4) as executor:
    futures = [executor.submit(process_data, chunk) for chunk in data_chunks]
    results = [f.result() for f in as_completed(futures)]

# I/Oバウンド: ThreadPoolExecutor
def fetch_url(url: str) -> str:
    response = requests.get(url, timeout=10)
    return response.text

urls = ['https://example.com'] * 100

with ThreadPoolExecutor(max_workers=20) as executor:
    futures = {executor.submit(fetch_url, url): url for url in urls}
    for future in as_completed(futures):
        url = futures[future]
        try:
            data = future.result()
            print(f"Fetched {url}: {len(data)} bytes")
        except Exception as e:
            print(f"Error fetching {url}: {e}")

Table 2. 並列処理手法の選択基準

処理タイプ	推奨手法	理由
CPUバウンド	multiprocessing	GIL回避、真の並列実行
I/Oバウンド（同期）	threading	軽量、I/O待ち中に他スレッド実行
I/Oバウンド（非同期）	asyncio	高効率、大量の同時接続
混合	ProcessPool + asyncio	CPU処理とI/Oを分離

3. Cython/Numba

Pythonコードをコンパイルして高速化する手法。

3.1 Numba

NumbaはJITコンパイラで、デコレータを追加するだけで高速化できる[3]。

from numba import jit, njit, prange
import numpy as np

# 基本的なJITコンパイル
@jit(nopython=True)
def sum_squares_numba(n: int) -> int:
    total = 0
    for i in range(n):
        total += i ** 2
    return total

# 並列化
@njit(parallel=True)
def parallel_sum(arr: np.ndarray) -> float:
    total = 0.0
    for i in prange(len(arr)):  # prangeで並列化
        total += arr[i] ** 2
    return total

# 性能比較
import time

n = 10_000_000

# Pure Python
def sum_squares_python(n):
    return sum(i ** 2 for i in range(n))

start = time.time()
sum_squares_python(n)
print(f"Python: {time.time() - start:.4f}s")

# Numba（初回はコンパイル時間含む）
sum_squares_numba(10)  # ウォームアップ
start = time.time()
sum_squares_numba(n)
print(f"Numba: {time.time() - start:.4f}s")

# 典型的な結果:
# Python: 1.2000s
# Numba: 0.0150s (80倍高速)

3.1.1 Numbaの制限：一部のPython機能はサポートされない。

# サポートされる: NumPy配列操作、基本的なPython構文
# サポートされない: リスト内包表記の一部、辞書、クラス（一部）

@njit
def supported_operations(arr: np.ndarray) -> np.ndarray:
    result = np.empty_like(arr)
    for i in range(len(arr)):
        result[i] = np.sin(arr[i]) + np.cos(arr[i])
    return result

# エラーになる例
# @njit
# def unsupported():
#     return {i: i**2 for i in range(10)}  # 辞書内包表記は不可

3.2 Cython

CythonはPythonをC拡張モジュールにコンパイルする。より細かい制御が可能。

# sum_squares.pyx
cimport cython
from libc.math cimport sqrt

@cython.boundscheck(False)
@cython.wraparound(False)
def sum_squares_cython(int n):
    cdef long long total = 0
    cdef int i
    for i in range(n):
        total += i * i
    return total

# 型付き配列操作
@cython.boundscheck(False)
@cython.wraparound(False)
def process_array(double[:] arr):
    cdef int i
    cdef int n = arr.shape[0]
    cdef double total = 0.0
    for i in range(n):
        total += arr[i] * arr[i]
    return sqrt(total)

# setup.py
from setuptools import setup
from Cython.Build import cythonize
import numpy as np

setup(
    ext_modules=cythonize("sum_squares.pyx"),
    include_dirs=[np.get_include()],
)

# ビルド
# python setup.py build_ext --inplace

Table 3. Numba vs Cython

観点	Numba	Cython
使いやすさ	簡単（デコレータのみ）	中程度（型宣言必要）
コンパイル	JIT（実行時）	AOT（事前）
C連携	限定的	強力
GPU対応	あり（CUDA）	なし
適用範囲	数値計算中心	汎用

4. GPU最適化

GPUを活用することで、大規模な並列計算を高速化できる。

4.1 CuPy

CuPyはNumPy互換のGPU配列ライブラリ[4]。コードをほぼ変更せずにGPU化可能。

import cupy as cp
import numpy as np
import time

# NumPyと同じAPI
n = 10_000_000

# CPU (NumPy)
a_np = np.random.rand(n).astype(np.float32)
b_np = np.random.rand(n).astype(np.float32)

start = time.time()
c_np = a_np + b_np
c_np = np.sin(c_np)
c_np = np.sum(c_np)
print(f"NumPy: {time.time() - start:.4f}s")

# GPU (CuPy)
a_cp = cp.asarray(a_np)  # GPUに転送
b_cp = cp.asarray(b_np)

start = time.time()
c_cp = a_cp + b_cp
c_cp = cp.sin(c_cp)
c_cp = cp.sum(c_cp)
cp.cuda.Stream.null.synchronize()  # GPU処理完了を待機
print(f"CuPy: {time.time() - start:.4f}s")

# 結果をCPUに戻す
result = cp.asnumpy(c_cp)

# 典型的な結果:
# NumPy: 0.1500s
# CuPy: 0.0050s (30倍高速)

4.2 PyTorch GPU活用

機械学習以外の数値計算にもPyTorchのGPU機能を活用できる。

import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# テンソル作成（GPU上）
a = torch.randn(10000, 10000, device=device)
b = torch.randn(10000, 10000, device=device)

# 行列演算（GPU上で実行）
start = time.time()
c = torch.matmul(a, b)
torch.cuda.synchronize()
print(f"GPU matmul: {time.time() - start:.4f}s")

# CPUとの比較
a_cpu = a.cpu()
b_cpu = b.cpu()

start = time.time()
c_cpu = torch.matmul(a_cpu, b_cpu)
print(f"CPU matmul: {time.time() - start:.4f}s")

# 典型的な結果（RTX 4090 vs Ryzen 9）:
# GPU matmul: 0.05s
# CPU matmul: 2.50s (50倍高速)

Fig. 2に最適化手法の選択フローを示す。

5. メモリ最適化

メモリ効率の改善は、大規模データ処理やメモリ制約環境で重要である。

5.1 メモリプロファイリング

# pip install memory_profiler

from memory_profiler import profile

@profile
def memory_heavy():
    # リスト作成（メモリ消費大）
    big_list = [i ** 2 for i in range(1_000_000)]
    return sum(big_list)

# 実行
# python -m memory_profiler script.py

# 出力例:
# Line #    Mem usage    Increment  Line Contents
#      4     50.0 MiB     50.0 MiB   big_list = [i ** 2 for i in range(1_000_000)]
#      5     50.0 MiB      0.0 MiB   return sum(big_list)

5.2 ジェネレータの活用

ジェネレータを使用してメモリ使用量を削減。

# メモリ非効率: リスト
def get_squares_list(n: int) -> list:
    return [i ** 2 for i in range(n)]  # 全要素をメモリに保持

# メモリ効率的: ジェネレータ
def get_squares_gen(n: int):
    for i in range(n):
        yield i ** 2  # 1要素ずつ生成

# 使用例
import sys

# リスト: メモリ消費大
squares_list = get_squares_list(1_000_000)
print(f"List size: {sys.getsizeof(squares_list) / 1e6:.1f} MB")

# ジェネレータ: メモリ消費最小
squares_gen = get_squares_gen(1_000_000)
print(f"Generator size: {sys.getsizeof(squares_gen)} bytes")

# 処理は同様に可能
total = sum(squares_gen)  # イテレーション時に生成

5.3 slotsの活用

クラスのメモリ使用量を削減。

import sys

# 通常のクラス（__dict__を持つ）
class PointRegular:
    def __init__(self, x: float, y: float):
        self.x = x
        self.y = y

# __slots__使用（__dict__を持たない）
class PointSlots:
    __slots__ = ['x', 'y']

    def __init__(self, x: float, y: float):
        self.x = x
        self.y = y

# メモリ比較
regular = PointRegular(1.0, 2.0)
slots = PointSlots(1.0, 2.0)

print(f"Regular: {sys.getsizeof(regular) + sys.getsizeof(regular.__dict__)} bytes")
print(f"Slots: {sys.getsizeof(slots)} bytes")

# 典型的な結果:
# Regular: 152 bytes
# Slots: 48 bytes (68%削減)

# 大量オブジェクト生成時の効果
points_regular = [PointRegular(i, i) for i in range(100_000)]
points_slots = [PointSlots(i, i) for i in range(100_000)]
# slots版は約15MB節約

5.4 データ型の最適化

適切なデータ型選択によるメモリ削減。

import numpy as np

# float64（デフォルト）
arr_f64 = np.random.rand(1_000_000)
print(f"float64: {arr_f64.nbytes / 1e6:.1f} MB")  # 8.0 MB

# float32
arr_f32 = arr_f64.astype(np.float32)
print(f"float32: {arr_f32.nbytes / 1e6:.1f} MB")  # 4.0 MB

# float16（精度要件が低い場合）
arr_f16 = arr_f64.astype(np.float16)
print(f"float16: {arr_f16.nbytes / 1e6:.1f} MB")  # 2.0 MB

# 整数も同様
int_arr = np.arange(1_000_000, dtype=np.int64)  # 8 MB
int_arr_32 = int_arr.astype(np.int32)           # 4 MB
int_arr_16 = int_arr.astype(np.int16)           # 2 MB（値が範囲内なら）

References

[1] Python Documentation, "The Python Profilers," docs.python.org, 2024.

[2] Ben Frederickson, "py-spy: Sampling profiler for Python programs," github.com/benfred/py-spy, 2024.

[3] Numba, "Numba Documentation," numba.pydata.org, 2024.

[4] CuPy, "CuPy Documentation," docs.cupy.dev, 2024.

免責事項
本コンテンツは2025年12月時点の情報に基づいて作成されている。パフォーマンス数値は環境により異なる。最適化は実測に基づいて行うことを推奨する。

← 前章：LLM開発｜次章：デプロイと運用 →