交叉验证与网格搜索详解

引言

想象一下两种不同的考试方式：

传统方式：只用一套题训练，用同一套题测试，容易"死记硬背"
交叉验证：用多套题训练和测试，轮流使用，更真实地评估能力

想象一下两种不同的调参方式：

手动调参：一个一个试，费时费力，容易遗漏最优解
网格搜索：系统性地尝试所有组合，自动找到最优参数

交叉验证和网格搜索是机器学习中两个非常重要的技术，它们帮助我们：

更准确地评估模型性能
更高效地找到最优参数
避免过拟合，提高泛化能力

本文将用生动的类比、详细的数学原理和完整的代码示例，带你深入理解这两个核心概念。

第一部分：交叉验证（Cross-Validation）

什么是交叉验证？

**交叉验证（Cross-Validation）**是一种评估模型性能的方法，通过将数据集分成多个子集，轮流使用不同的子集作为训练集和测试集，来评估模型的泛化能力。

类比理解：

就像用多套试卷测试学生，而不是只用一套
就像轮流使用不同的数据来训练和测试模型
就像多次"考试"，取平均成绩更可靠

为什么需要交叉验证？

1. 传统方法的局限性

传统方法：简单划分

# 传统方法：简单划分训练集和测试集
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 划分数据集（70% 训练，30% 测试）
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 训练模型
model = LogisticRegression()
model.fit(X_train, y_train)

# 评估模型
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"准确率: {accuracy}")

问题：

只评估一次，结果可能不稳定
如果测试集恰好"简单"或"困难"，评估不准确
无法充分利用数据

类比理解：

就像只考一次试，可能因为题目恰好简单或难而评估不准确
就像只用一部分数据，浪费了其他数据

2. 交叉验证的优势

交叉验证方法：多次评估

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

# 使用交叉验证评估模型
model = LogisticRegression()
scores = cross_val_score(model, X, y, cv=5)  # 5折交叉验证

print(f"每次得分: {scores}")
print(f"平均得分: {scores.mean():.4f}")
print(f"标准差: {scores.std():.4f}")

优势：

多次评估，结果更稳定
充分利用所有数据
更准确地评估模型性能

类比理解：

就像多次考试取平均，评估更准确
就像所有数据都参与训练和测试，不浪费

K折交叉验证（K-Fold Cross-Validation）

原理

K折交叉验证是最常用的交叉验证方法：

将数据集分成 K 个相等的子集（折）
轮流使用其中 K-1 个作为训练集，1 个作为测试集
重复 K 次，每次使用不同的折作为测试集
计算 K 次评估结果的平均值

数学表示：

给定数据集 $D = {(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)}$，将其分成 K 个折：

$$
D = D_1 \cup D_2 \cup ... \cup D_K
$$

对于第 $k$ 次迭代：

训练集：$D_{train}^{(k)} = D \setminus D_k$
测试集：$D_{test}^{(k)} = D_k$

评估指标（如准确率）：

$$
Score_k = \text{Metric}(Model(D_{train}^{(k)}), D_{test}^{(k)})
$$

最终得分：

$$
Score_{CV} = \frac{1}{K} \sum_{k=1}^{K} Score_k
$$

类比理解：

$K$：就像试卷的套数
$D_k$：就像第 $k$ 套试卷
$Score_k$：就像第 $k$ 次考试的成绩
$Score_{CV}$：就像所有考试的平均成绩

可视化理解

数据集: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

5折交叉验证：

第1折: 训练集=[2,3,4,5,6,7,8,9,10] 测试集=[1]
第2折: 训练集=[1,3,4,5,6,7,8,9,10] 测试集=[2]
第3折: 训练集=[1,2,4,5,6,7,8,9,10] 测试集=[3]
第4折: 训练集=[1,2,3,5,6,7,8,9,10] 测试集=[4]
第5折: 训练集=[1,2,3,4,6,7,8,9,10] 测试集=[5]

代码示例

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 加载数据
iris = load_iris()
X, y = iris.data, iris.target

# 创建模型
model = LogisticRegression(max_iter=1000)

# 方法1：使用 cross_val_score（最简单）
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print("方法1 - 使用 cross_val_score:")
print(f"每次得分: {scores}")
print(f"平均得分: {scores.mean():.4f} ± {scores.std():.4f}")

# 方法2：手动实现 K折交叉验证（更灵活）
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = []

for train_idx, test_idx in kfold.split(X):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    
    # 训练模型
    model.fit(X_train, y_train)
    
    # 评估模型
    y_pred = model.predict(X_test)
    score = accuracy_score(y_test, y_pred)
    cv_scores.append(score)
    print(f"折 {len(cv_scores)}: 准确率 = {score:.4f}")

print(f"\n平均准确率: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}")

其他交叉验证方法

1. 留一交叉验证（Leave-One-Out Cross-Validation, LOOCV）

**原理：**每次只留一个样本作为测试集，其余作为训练集。

特点：

K = 样本数
计算成本高
适合小数据集

from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo)
print(f"LOOCV 平均得分: {scores.mean():.4f}")

2. 分层K折交叉验证（Stratified K-Fold）

**原理：**保持每个折中各类别的比例与原始数据集相同。

**适用场景：**分类问题，特别是类别不平衡的情况。

from sklearn.model_selection import StratifiedKFold

skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skfold, scoring='accuracy')
print(f"分层K折交叉验证得分: {scores.mean():.4f} ± {scores.std():.4f}")

3. 时间序列交叉验证（Time Series Split）

**原理：**按时间顺序划分，前面的数据训练，后面的数据测试。

**适用场景：**时间序列数据。

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
    print(f"训练集大小: {len(train_idx)}, 测试集大小: {len(test_idx)}")

交叉验证的注意事项

1. 数据泄露（Data Leakage）

**问题：**在划分数据前进行预处理（如标准化），会导致测试集信息泄露到训练集。

错误示例：

# ❌ 错误：先标准化，再划分
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # 使用了全部数据的信息
X_train, X_test = train_test_split(X_scaled, test_size=0.3)

正确示例：

# ✅ 正确：先划分，再标准化
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # 只用训练集拟合
X_test_scaled = scaler.transform(X_test)  # 用训练集的参数转换测试集

2. 交叉验证中的预处理

在交叉验证中，每次折叠都需要独立进行预处理：

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

# 使用 Pipeline 自动处理预处理
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

# 交叉验证会自动处理预处理
scores = cross_val_score(pipeline, X, y, cv=5)

第二部分：网格搜索（Grid Search）

什么是网格搜索？

**网格搜索（Grid Search）**是一种超参数优化方法，通过系统地遍历所有可能的参数组合，找到使模型性能最优的参数。

类比理解：

就像系统地尝试所有调料组合，找到最佳配方
就像系统地尝试所有参数组合，找到最佳模型
就像在参数空间中"网格状"地搜索最优解

为什么需要网格搜索？

1. 手动调参的问题

手动调参示例：

# 手动尝试不同的参数
best_score = 0
best_params = None

for C in [0.1, 1, 10]:
    for penalty in ['l1', 'l2']:
        model = LogisticRegression(C=C, penalty=penalty, solver='liblinear')
        score = cross_val_score(model, X, y, cv=5).mean()
        
        if score > best_score:
            best_score = score
            best_params = {'C': C, 'penalty': penalty}
        
        print(f"C={C}, penalty={penalty}, score={score:.4f}")

print(f"\n最佳参数: {best_params}")
print(f"最佳得分: {best_score:.4f}")

问题：

需要手动编写循环
容易遗漏某些组合
无法并行计算
代码冗长

2. 网格搜索的优势

网格搜索示例：

from sklearn.model_selection import GridSearchCV

# 定义参数网格
param_grid = {
    'C': [0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'lbfgs']
}

# 创建模型
model = LogisticRegression(max_iter=1000)

# 网格搜索
grid_search = GridSearchCV(
    model, 
    param_grid, 
    cv=5,  # 5折交叉验证
    scoring='accuracy',
    n_jobs=-1  # 并行计算
)

# 执行搜索
grid_search.fit(X, y)

# 查看结果
print(f"最佳参数: {grid_search.best_params_}")
print(f"最佳得分: {grid_search.best_score_:.4f}")

优势：

自动遍历所有组合
结合交叉验证评估
支持并行计算
代码简洁

网格搜索的原理

参数空间

参数网格定义：

param_grid = {
    'C': [0.1, 1, 10],
    'penalty': ['l1', 'l2']
}

参数组合：

(C=0.1, penalty='l1')
(C=0.1, penalty='l2')
(C=1, penalty='l1')
(C=1, penalty='l2')
(C=10, penalty='l1')
(C=10, penalty='l2')

总共 3 × 2 = 6 种组合。

数学表示：

给定参数网格：
$$
\Theta = {\theta_1, \theta_2, ..., \theta_n}
$$

其中每个 $\theta_i$ 是一个参数组合：
$$
\theta_i = (p_1^{(i)}, p_2^{(i)}, ..., p_m^{(i)})
$$

网格搜索的目标是找到：
$$
\theta^* = \arg\max_{\theta \in \Theta} Score_{CV}(\theta)
$$

其中 $Score_{CV}(\theta)$ 是使用参数 $\theta$ 的交叉验证得分。

搜索过程

对于每个参数组合 θ:
    1. 使用交叉验证评估模型性能
    2. 记录得分
    
找到得分最高的参数组合
返回最佳参数和最佳得分

GridSearchCV 详解

基本用法

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris

# 加载数据
iris = load_iris()
X, y = iris.data, iris.target

# 定义参数网格
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf', 'linear']
}

# 创建模型
svm = SVC()

# 网格搜索
grid_search = GridSearchCV(
    estimator=svm,
    param_grid=param_grid,
    cv=5,  # 5折交叉验证
    scoring='accuracy',  # 评估指标
    n_jobs=-1,  # 使用所有CPU核心
    verbose=1  # 显示进度
)

# 执行搜索
grid_search.fit(X, y)

# 查看结果
print("最佳参数:", grid_search.best_params_)
print("最佳交叉验证得分:", grid_search.best_score_)
print("最佳模型:", grid_search.best_estimator_)

重要参数说明

GridSearchCV 的主要参数：

estimator: 要优化的模型
param_grid: 参数网格（字典或字典列表）
cv: 交叉验证折数或交叉验证对象
scoring: 评估指标（'accuracy', 'f1', 'roc_auc' 等）
n_jobs: 并行任务数（-1 表示使用所有核心）
verbose: 详细程度（0-3）
refit: 是否用最佳参数重新训练整个数据集（默认 True）

查看详细结果

import pandas as pd

# 将结果转换为 DataFrame
results_df = pd.DataFrame(grid_search.cv_results_)

# 查看前几列
print(results_df[['params', 'mean_test_score', 'std_test_score', 'rank_test_score']].head(10))

# 查看最佳参数的所有信息
best_idx = grid_search.best_index_
print("\n最佳参数组合的详细信息:")
print(results_df.iloc[best_idx])

随机搜索（Random Search）

随机搜索是网格搜索的替代方法，随机采样参数组合而不是遍历所有组合。

适用场景：

参数空间很大
计算资源有限
某些参数对性能影响较小

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint

# 定义参数分布
param_distributions = {
    'C': uniform(0.1, 100),  # 均匀分布
    'gamma': uniform(0.001, 1),
    'kernel': ['rbf', 'linear']
}

# 随机搜索
random_search = RandomizedSearchCV(
    svm,
    param_distributions,
    n_iter=50,  # 随机尝试50次
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42
)

random_search.fit(X, y)

print("最佳参数:", random_search.best_params_)
print("最佳得分:", random_search.best_score_)

网格搜索与交叉验证的结合

GridSearchCV 内部流程：

对于每个参数组合:
    使用交叉验证评估:
        第1折: 训练 → 测试 → 得分1
        第2折: 训练 → 测试 → 得分2
        ...
        第K折: 训练 → 测试 → 得分K
    
    计算平均得分 = (得分1 + 得分2 + ... + 得分K) / K

找到平均得分最高的参数组合

完整示例：

from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# 创建管道（包含预处理和模型）
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier())
])

# 定义参数网格（注意：参数名前缀）
param_grid = {
    'scaler__with_mean': [True, False],
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [3, 5, 7, None],
    'classifier__min_samples_split': [2, 5, 10]
}

# 使用分层K折交叉验证
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 网格搜索
grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=cv,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X, y)

print("最佳参数:", grid_search.best_params_)
print("最佳得分:", grid_search.best_score_)

第三部分：实际应用示例

示例1：分类问题 - 鸢尾花数据集

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt

# 加载数据
iris = load_iris()
X, y = iris.data, iris.target

print("数据集形状:", X.shape)
print("类别数:", len(np.unique(y)))

# 示例1：SVM 网格搜索
print("\n=== SVM 网格搜索 ===")
svm_param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf', 'linear']
}

svm = SVC()
svm_grid = GridSearchCV(
    svm, svm_param_grid, cv=5, 
    scoring='accuracy', n_jobs=-1
)
svm_grid.fit(X, y)

print(f"最佳参数: {svm_grid.best_params_}")
print(f"最佳得分: {svm_grid.best_score_:.4f}")

# 示例2：随机森林网格搜索
print("\n=== 随机森林网格搜索 ===")
rf_param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 5, 10]
}

rf = RandomForestClassifier(random_state=42)
rf_grid = GridSearchCV(
    rf, rf_param_grid, cv=5,
    scoring='accuracy', n_jobs=-1
)
rf_grid.fit(X, y)

print(f"最佳参数: {rf_grid.best_params_}")
print(f"最佳得分: {rf_grid.best_score_:.4f}")

# 示例3：使用 Pipeline
print("\n=== 使用 Pipeline 的网格搜索 ===")
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC())
])

pipeline_param_grid = {
    'scaler__with_mean': [True, False],
    'svm__C': [0.1, 1, 10],
    'svm__gamma': [0.01, 0.1, 1],
    'svm__kernel': ['rbf', 'linear']
}

pipeline_grid = GridSearchCV(
    pipeline, pipeline_param_grid, cv=5,
    scoring='accuracy', n_jobs=-1
)
pipeline_grid.fit(X, y)

print(f"最佳参数: {pipeline_grid.best_params_}")
print(f"最佳得分: {pipeline_grid.best_score_:.4f}")

示例2：回归问题 - 波士顿房价

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score

# 加载数据
housing = fetch_california_housing()
X, y = housing.data, housing.target

print("数据集形状:", X.shape)

# 定义参数网格
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 1.0]
}

# 创建模型
gbr = GradientBoostingRegressor(random_state=42)

# 网格搜索（回归问题使用负均方误差作为评分）
grid_search = GridSearchCV(
    gbr, param_grid, cv=5,
    scoring='neg_mean_squared_error',  # 回归问题使用负MSE
    n_jobs=-1, verbose=1
)

grid_search.fit(X, y)

print(f"\n最佳参数: {grid_search.best_params_}")
print(f"最佳得分 (负MSE): {grid_search.best_score_:.4f}")
print(f"最佳RMSE: {np.sqrt(-grid_search.best_score_):.4f}")

# 使用最佳模型预测
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X)
print(f"\nR²得分: {r2_score(y, y_pred):.4f}")

示例3：多指标评估

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, accuracy_score, f1_score, roc_auc_score
from sklearn.linear_model import LogisticRegression

# 定义多个评估指标
scoring = {
    'accuracy': 'accuracy',
    'f1': 'f1_macro',
    'roc_auc': 'roc_auc_ovr'
}

# 参数网格
param_grid = {
    'C': [0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'lbfgs']
}

model = LogisticRegression(max_iter=1000)

# 多指标网格搜索
grid_search = GridSearchCV(
    model, param_grid, cv=5,
    scoring=scoring,
    refit='accuracy',  # 用accuracy重新训练最佳模型
    n_jobs=-1
)

grid_search.fit(X, y)

print("最佳参数:", grid_search.best_params_)
print("最佳准确率:", grid_search.best_score_)

# 查看所有指标的得分
print("\n所有指标的交叉验证得分:")
for metric in scoring.keys():
    scores = grid_search.cv_results_[f'mean_test_{metric}']
    best_idx = grid_search.best_index_
    print(f"{metric}: {scores[best_idx]:.4f}")

第四部分：最佳实践与注意事项

1. 数据预处理顺序

正确顺序：

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

# ✅ 正确：使用 Pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

param_grid = {
    'model__C': [0.1, 1, 10]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X, y)

错误顺序：

# ❌ 错误：先预处理，再网格搜索
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # 数据泄露！

grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid_search.fit(X_scaled, y)

2. 参数网格设计

原则：

从粗到细：先大范围搜索，再精细调整
考虑计算成本：参数组合数 = 各参数取值数的乘积
使用对数尺度：对于某些参数（如 C），使用对数尺度更合理

import numpy as np

# 粗搜索
param_grid_coarse = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1]
}

# 细搜索（在最佳参数附近）
param_grid_fine = {
    'C': np.logspace(-1, 2, 10),  # 对数尺度
    'gamma': np.logspace(-3, 0, 10)
}

3. 交叉验证折数选择

建议：

小数据集（<1000样本）：使用 LOOCV 或 10折
中等数据集（1000-10000样本）：使用 5折或 10折
大数据集（>10000样本）：使用 3折或 5折

# 根据数据集大小选择折数
n_samples = len(X)
if n_samples < 100:
    cv = LeaveOneOut()
elif n_samples < 1000:
    cv = 10
else:
    cv = 5

4. 计算资源管理

并行计算：

# 使用所有CPU核心
grid_search = GridSearchCV(model, param_grid, n_jobs=-1)

# 限制CPU核心数
grid_search = GridSearchCV(model, param_grid, n_jobs=4)

# 不使用并行（调试时）
grid_search = GridSearchCV(model, param_grid, n_jobs=1)

内存管理：

# 对于大数据集，使用预计算
from sklearn.model_selection import PredefinedSplit

# 或者使用较小的参数网格
# 或者使用随机搜索代替网格搜索

5. 避免过拟合

**问题：**在网格搜索中使用交叉验证，但最终在测试集上表现不佳。

**原因：**可能是在整个数据集上进行了多次交叉验证，导致"信息泄露"。

**解决方案：**使用嵌套交叉验证或保留独立的测试集。

from sklearn.model_selection import cross_val_score

# 嵌套交叉验证
def nested_cv(X, y, model, param_grid, outer_cv=5, inner_cv=5):
    outer_scores = []
    
    outer_kfold = KFold(n_splits=outer_cv, shuffle=True, random_state=42)
    
    for train_idx, test_idx in outer_kfold.split(X):
        X_train, X_test = X[train_idx], X[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]
        
        # 内层：网格搜索
        inner_grid = GridSearchCV(
            model, param_grid, cv=inner_cv, scoring='accuracy'
        )
        inner_grid.fit(X_train, y_train)
        
        # 外层：评估
        best_model = inner_grid.best_estimator_
        score = best_model.score(X_test, y_test)
        outer_scores.append(score)
    
    return np.mean(outer_scores), np.std(outer_scores)

# 使用嵌套交叉验证
mean_score, std_score = nested_cv(X, y, SVC(), svm_param_grid)
print(f"嵌套交叉验证得分: {mean_score:.4f} ± {std_score:.4f}")

第五部分：总结与扩展

核心概念总结

交叉验证

目的：更准确地评估模型性能
方法：将数据分成多折，轮流训练和测试
优势：充分利用数据，评估更稳定
常用方法：K折交叉验证、分层K折、留一交叉验证

网格搜索

目的：找到最优超参数
方法：系统地遍历所有参数组合
优势：自动搜索，结合交叉验证
替代方法：随机搜索、贝叶斯优化

完整工作流程

# 1. 导入库
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# 2. 加载和划分数据
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. 创建管道
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

# 4. 定义参数网格
param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [3, 5, 7, None],
    'classifier__min_samples_split': [2, 5, 10]
}

# 5. 网格搜索
grid_search = GridSearchCV(
    pipeline, param_grid, cv=5,
    scoring='accuracy', n_jobs=-1, verbose=1
)

# 6. 训练和搜索
grid_search.fit(X_train, y_train)

# 7. 评估最佳模型
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

print("最佳参数:", grid_search.best_params_)
print("最佳交叉验证得分:", grid_search.best_score_)
print("\n测试集性能:")
print(classification_report(y_test, y_pred))

扩展学习

1. 贝叶斯优化（Bayesian Optimization）

比网格搜索和随机搜索更高效的方法：

from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer

# 定义参数空间
param_space = {
    'C': Real(0.1, 100, prior='log-uniform'),
    'gamma': Real(0.001, 1, prior='log-uniform'),
    'kernel': Categorical(['rbf', 'linear'])
}

# 贝叶斯优化
bayes_search = BayesSearchCV(
    SVC(), param_space, n_iter=50, cv=5,
    scoring='accuracy', n_jobs=-1
)

bayes_search.fit(X, y)
print("最佳参数:", bayes_search.best_params_)

2. 早停（Early Stopping）

对于某些算法（如梯度提升），可以使用早停来节省时间：

from sklearn.ensemble import GradientBoostingClassifier

param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2]
}

# 梯度提升支持早停
gbr = GradientBoostingClassifier(
    n_iter_no_change=10,  # 10轮无改善则停止
    validation_fraction=0.2  # 20%数据用于验证
)

3. 自定义评估指标

from sklearn.metrics import make_scorer

# 定义自定义评估函数
def custom_score(y_true, y_pred):
    # 你的自定义逻辑
    return accuracy_score(y_true, y_pred)

# 创建评估器
custom_scorer = make_scorer(custom_score)

# 在网格搜索中使用
grid_search = GridSearchCV(
    model, param_grid, cv=5,
    scoring=custom_scorer
)

常见问题解答

Q1: 交叉验证和网格搜索有什么区别？

A: 交叉验证是评估模型性能的方法，网格搜索是优化超参数的方法。网格搜索通常内部使用交叉验证来评估每个参数组合。

Q2: 应该使用多少折交叉验证？

A: 通常使用 5 折或 10 折。数据集很小时可以使用留一交叉验证，数据集很大时可以使用 3 折。

Q3: 网格搜索需要多长时间？

A: 时间 = 参数组合数 × 交叉验证折数 × 单次训练时间。可以通过并行计算（n_jobs=-1）加速。

Q4: 如何选择参数网格的范围？

A: 从粗到细：先大范围搜索找到大致范围，再在该范围内精细搜索。对于某些参数（如正则化系数），使用对数尺度更合理。

Q5: 网格搜索会过拟合吗？

A: 可能会。如果使用整个数据集进行网格搜索，然后在同一数据集上测试，可能导致过拟合。建议使用嵌套交叉验证或保留独立的测试集。

结语

交叉验证和网格搜索是机器学习中两个非常重要的技术：

交叉验证帮助我们更准确地评估模型性能，避免因数据划分导致的评估偏差
网格搜索帮助我们系统地找到最优超参数，提高模型性能

掌握这两个技术，能够让你：

更自信地评估模型
更高效地调参
构建更好的机器学习模型

希望本文能够帮助你深入理解这两个概念，并在实际项目中灵活运用！

参考文献

Scikit-learn 官方文档: https://scikit-learn.org/stable/
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning
James, G., et al. (2013). An Introduction to Statistical Learning

分享

交叉验证与网格搜索详解

交叉验证与网格搜索详解

引言

第一部分：交叉验证（Cross-Validation）

什么是交叉验证？

为什么需要交叉验证？

1. 传统方法的局限性

2. 交叉验证的优势

K折交叉验证（K-Fold Cross-Validation）

原理

可视化理解

代码示例

其他交叉验证方法

1. 留一交叉验证（Leave-One-Out Cross-Validation, LOOCV）

2. 分层K折交叉验证（Stratified K-Fold）

3. 时间序列交叉验证（Time Series Split）

交叉验证的注意事项

1. 数据泄露（Data Leakage）

2. 交叉验证中的预处理

第二部分：网格搜索（Grid Search）

什么是网格搜索？

为什么需要网格搜索？

1. 手动调参的问题

2. 网格搜索的优势

网格搜索的原理

参数空间

搜索过程

GridSearchCV 详解

基本用法

重要参数说明

查看详细结果

随机搜索（Random Search）

网格搜索与交叉验证的结合

第三部分：实际应用示例

示例1：分类问题 - 鸢尾花数据集

示例2：回归问题 - 波士顿房价

示例3：多指标评估

第四部分：最佳实践与注意事项

1. 数据预处理顺序

2. 参数网格设计

3. 交叉验证折数选择

4. 计算资源管理

5. 避免过拟合

第五部分：总结与扩展

核心概念总结

交叉验证

网格搜索

完整工作流程

扩展学习

1. 贝叶斯优化（Bayesian Optimization）

2. 早停（Early Stopping）

3. 自定义评估指标

常见问题解答

结语

参考文献

评论