线性回归详解

引言

想象一下，你要预测一个学生的成绩：

方法1：只看学习时间，时间越长成绩越好（简单方法）
方法2：看学习时间、复习次数、作业完成度等多个因素（复杂方法）

线性回归就像这两种方法——通过找到输入和输出之间的线性关系，来预测未知的结果。

本文将用生动的类比、详细的数学原理和丰富的可视化，带你深入理解线性回归：

什么是线性回归？
一元线性回归和多元线性回归
如何找到最佳拟合线？
线性回归的应用场景
在机器学习中的实际应用

第一部分：什么是线性回归？

线性回归的直观理解

**线性回归（Linear Regression）**是一种通过拟合一条直线（或超平面）来预测连续数值的方法。

类比理解：

就像用尺子画一条最接近所有点的直线
就像找到"最佳趋势线"
就像用一条线总结数据的规律

生活中的例子

例子1：房价预测

输入：房屋面积（平方米）
输出：房价（万元）
关系：面积越大，房价越高（线性关系）

例子2：学习时间与成绩

输入：每天学习时间（小时）
输出：考试成绩（分）
关系：学习时间越长，成绩越好（线性关系）

例子3：广告投入与销售额

输入：广告投入（万元）
输出：销售额（万元）
关系：广告投入越多，销售额越高（线性关系）

线性回归的数学表示

一般形式：
$$
y = w_1 x_1 + w_2 x_2 + ... + w_n x_n + b
$$

其中：

$y$：预测值（输出）
$x_1, x_2, ..., x_n$：特征（输入）
$w_1, w_2, ..., w_n$：权重（参数）
$b$：偏置（截距）

一元线性回归（最简单）：
$$
y = wx + b
$$

其中：

$w$：斜率
$b$：截距

类比理解：

$w$：就像"每增加1个单位，输出增加多少"
$b$：就像"起点值"
线性关系：就像"成比例增长"

线性回归的目标

**目标：**找到最佳的 $w$ 和 $b$，使得预测值尽可能接近真实值。

数学表示：
$$
\min_{w,b} \sum_{i=1}^{n} (y_i - (wx_i + b))^2
$$

类比理解：

就像找到一条直线，使得所有点到直线的距离平方和最小
就像找到"最佳拟合线"

第二部分：一元线性回归（Simple Linear Regression）

什么是一元线性回归？

一元线性回归是最简单的线性回归形式，只有一个输入特征和一个输出。

数学表示：
$$
y = wx + b + \epsilon
$$

其中：

$x$：输入特征
$y$：输出值
$w$：权重（斜率）
$b$：偏置（截距）
$\epsilon$：误差项

预测函数：
$$
\hat{y} = wx + b
$$

可视化示例：学习时间与成绩

让我们用 matplotlib 可视化一元线性回归：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei', 'Microsoft YaHei']
plt.rcParams['axes.unicode_minus'] = False

# 生成示例数据：学习时间与成绩
np.random.seed(42)
study_hours = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
# 真实关系：成绩 = 10 * 学习时间 + 60 + 噪声
true_scores = 10 * study_hours + 60 + np.random.normal(0, 5, len(study_hours))

# 创建线性回归模型
model = LinearRegression()
model.fit(study_hours.reshape(-1, 1), true_scores)

# 预测
predicted_scores = model.predict(study_hours.reshape(-1, 1))

# 计算评估指标
r2 = r2_score(true_scores, predicted_scores)
mse = mean_squared_error(true_scores, predicted_scores)

# 创建图形
fig, ax = plt.subplots(1, 1, figsize=(12, 8))

# 绘制数据点
ax.scatter(study_hours, true_scores, color='blue', s=100, alpha=0.6, 
           label='实际成绩', zorder=5, edgecolors='black', linewidth=1.5)

# 绘制拟合直线
x_line = np.linspace(0, 11, 100)
y_line = model.predict(x_line.reshape(-1, 1))
ax.plot(x_line, y_line, 'r-', linewidth=3, label=f'拟合直线: y = {model.coef_[0]:.2f}x + {model.intercept_:.2f}', zorder=3)

# 绘制残差线（从点到直线的距离）
for i in range(len(study_hours)):
    ax.plot([study_hours[i], study_hours[i]], [true_scores[i], predicted_scores[i]], 
            'g--', alpha=0.5, linewidth=1, zorder=2)

# 标注关键点
ax.plot(study_hours, predicted_scores, 'ro', markersize=8, alpha=0.7, 
        label='预测值', zorder=4)

# 设置坐标轴
ax.set_xlabel('学习时间（小时）', fontsize=14, weight='bold')
ax.set_ylabel('考试成绩（分）', fontsize=14, weight='bold')
ax.set_title('一元线性回归：学习时间 vs 考试成绩', fontsize=16, weight='bold')
ax.legend(loc='best', fontsize=12)
ax.grid(True, alpha=0.3)

# 添加统计信息
info_text = f'拟合方程: y = {model.coef_[0]:.2f}x + {model.intercept_:.2f}\n'
info_text += f'R² 得分: {r2:.4f}\n'
info_text += f'均方误差 (MSE): {mse:.2f}\n'
info_text += f'斜率 (w): {model.coef_[0]:.2f}\n'
info_text += f'截距 (b): {model.intercept_:.2f}'

ax.text(0.02, 0.98, info_text, transform=ax.transAxes, 
        fontsize=11, verticalalignment='top',
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

plt.tight_layout()
plt.savefig('simple_linear_regression.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"拟合方程: y = {model.coef_[0]:.2f}x + {model.intercept_:.2f}")
print(f"R² 得分: {r2:.4f}")
print(f"均方误差: {mse:.2f}")

最小二乘法（Least Squares Method）

最小二乘法是求解线性回归参数的标准方法。

目标函数（损失函数）：
$$
L(w, b) = \sum_{i=1}^{n} (y_i - (wx_i + b))^2
$$

求解过程：

对 $w$ 求偏导：
$$
\frac{\partial L}{\partial w} = -2\sum_{i=1}^{n} x_i(y_i - wx_i - b) = 0
$$
对 $b$ 求偏导：
$$
\frac{\partial L}{\partial b} = -2\sum_{i=1}^{n} (y_i - wx_i - b) = 0
$$
求解得到：
$$
w = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}
$$

$$
b = \bar{y} - w\bar{x}
$$

其中 $\bar{x}$ 和 $\bar{y}$ 分别是 $x$ 和 $y$ 的均值。

可视化：最小二乘法的原理

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei', 'Microsoft YaHei']
plt.rcParams['axes.unicode_minus'] = False

# 生成数据
np.random.seed(42)
x = np.array([1, 2, 3, 4, 5, 6, 7, 8])
y = 2 * x + 3 + np.random.normal(0, 1, len(x))

# 计算不同w和b的损失
w_range = np.linspace(0, 4, 50)
b_range = np.linspace(0, 6, 50)
W, B = np.meshgrid(w_range, b_range)

# 计算损失函数
Loss = np.zeros_like(W)
for i in range(len(w_range)):
    for j in range(len(b_range)):
        y_pred = W[j, i] * x + B[j, i]
        Loss[j, i] = np.sum((y - y_pred)**2)

# 找到最小值
min_idx = np.unravel_index(np.argmin(Loss), Loss.shape)
w_opt = W[min_idx]
b_opt = B[min_idx]

# 创建图形
fig = plt.figure(figsize=(16, 6))

# 左图：损失函数3D图
ax1 = fig.add_subplot(121, projection='3d')
surf = ax1.plot_surface(W, B, Loss, cmap='viridis', alpha=0.8, 
                       linewidth=0, antialiased=True)
ax1.scatter([w_opt], [b_opt], [Loss[min_idx]], color='red', 
           s=200, marker='*', label='最优点')
ax1.set_xlabel('权重 w', fontsize=12)
ax1.set_ylabel('偏置 b', fontsize=12)
ax1.set_zlabel('损失函数值', fontsize=12)
ax1.set_title('损失函数3D图（最小二乘法）', fontsize=14, weight='bold')
ax1.legend()

# 右图：等高线图
ax2 = fig.add_subplot(122)
contour = ax2.contour(W, B, Loss, levels=20, cmap='viridis')
ax2.clabel(contour, inline=True, fontsize=8)
ax2.scatter([w_opt], [b_opt], color='red', s=200, marker='*', 
           label=f'最优点 (w={w_opt:.2f}, b={b_opt:.2f})', zorder=5)
ax2.set_xlabel('权重 w', fontsize=12)
ax2.set_ylabel('偏置 b', fontsize=12)
ax2.set_title('损失函数等高线图', fontsize=14, weight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('least_squares_visualization.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"最优权重 w: {w_opt:.4f}")
print(f"最优偏置 b: {b_opt:.4f}")
print(f"最小损失值: {Loss[min_idx]:.4f}")

代码实现：手动实现一元线性回归

import numpy as np
import matplotlib.pyplot as plt

class SimpleLinearRegression:
    """一元线性回归"""
    
    def __init__(self):
        self.w = None  # 权重
        self.b = None  # 偏置
    
    def fit(self, X, y):
        """
        训练模型
        
        参数:
            X: 输入特征（一维数组）
            y: 输出值（一维数组）
        """
        X = np.array(X)
        y = np.array(y)
        
        # 计算均值
        x_mean = np.mean(X)
        y_mean = np.mean(y)
        
        # 计算权重 w
        numerator = np.sum((X - x_mean) * (y - y_mean))
        denominator = np.sum((X - x_mean)**2)
        self.w = numerator / denominator
        
        # 计算偏置 b
        self.b = y_mean - self.w * x_mean
        
        return self
    
    def predict(self, X):
        """
        预测
        
        参数:
            X: 输入特征
        
        返回:
            预测值
        """
        return self.w * np.array(X) + self.b
    
    def score(self, X, y):
        """
        计算R²得分
        
        参数:
            X: 输入特征
            y: 真实值
        
        返回:
            R²得分
        """
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred)**2)
        ss_tot = np.sum((y - np.mean(y))**2)
        return 1 - (ss_res / ss_tot)

# 使用示例
np.random.seed(42)
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = 2 * X + 3 + np.random.normal(0, 1, len(X))

# 创建模型
model = SimpleLinearRegression()
model.fit(X, y)

# 预测
y_pred = model.predict(X)

# 评估
r2 = model.score(X, y)

print(f"权重 w: {model.w:.4f}")
print(f"偏置 b: {model.b:.4f}")
print(f"R²得分: {r2:.4f}")

第三部分：多元线性回归（Multiple Linear Regression）

什么是多元线性回归？

多元线性回归有多个输入特征，用于预测一个输出值。

数学表示：
$$
y = w_1 x_1 + w_2 x_2 + ... + w_n x_n + b + \epsilon
$$

矩阵形式：
$$
\mathbf{y} = \mathbf{X}\mathbf{w} + \mathbf{b} + \boldsymbol{\epsilon}
$$

其中：

$\mathbf{X}$：特征矩阵（$n \times m$，$n$ 个样本，$m$ 个特征）
$\mathbf{w}$：权重向量（$m \times 1$）
$\mathbf{y}$：输出向量（$n \times 1$）
$\mathbf{b}$：偏置（标量）

生活中的例子

例子：房价预测（多因素）

特征1：房屋面积（平方米）
特征2：房间数量（间）
特征3：距离市中心距离（公里）
特征4：楼层（层）
输出：房价（万元）

关系：
$$
\text{房价} = w_1 \times \text{面积} + w_2 \times \text{房间数} + w_3 \times \text{距离} + w_4 \times \text{楼层} + b
$$

可视化示例：多特征回归

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from mpl_toolkits.mplot3d import Axes3D

# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei', 'Microsoft YaHei']
plt.rcParams['axes.unicode_minus'] = False

# 生成示例数据：房价预测
np.random.seed(42)
n_samples = 100

# 特征：面积、房间数
area = np.random.uniform(50, 200, n_samples)
rooms = np.random.uniform(1, 5, n_samples)

# 真实关系：房价 = 0.5*面积 + 10*房间数 + 50 + 噪声
price = 0.5 * area + 10 * rooms + 50 + np.random.normal(0, 5, n_samples)

# 创建模型
X = np.column_stack([area, rooms])
model = LinearRegression()
model.fit(X, price)

# 预测
price_pred = model.predict(X)
r2 = r2_score(price, price_pred)

# 创建3D图形
fig = plt.figure(figsize=(16, 6))

# 左图：3D散点图和拟合平面
ax1 = fig.add_subplot(121, projection='3d')

# 绘制数据点
scatter = ax1.scatter(area, rooms, price, c=price, cmap='viridis', 
                     s=50, alpha=0.6, edgecolors='black', linewidth=0.5)

# 创建拟合平面
area_range = np.linspace(area.min(), area.max(), 20)
rooms_range = np.linspace(rooms.min(), rooms.max(), 20)
Area_grid, Rooms_grid = np.meshgrid(area_range, rooms_range)
Price_grid = model.coef_[0] * Area_grid + model.coef_[1] * Rooms_grid + model.intercept_

# 绘制拟合平面
ax1.plot_surface(Area_grid, Rooms_grid, Price_grid, alpha=0.3, 
                color='red', label='拟合平面')

ax1.set_xlabel('房屋面积（平方米）', fontsize=12)
ax1.set_ylabel('房间数量（间）', fontsize=12)
ax1.set_zlabel('房价（万元）', fontsize=12)
ax1.set_title('多元线性回归：3D可视化', fontsize=14, weight='bold')
plt.colorbar(scatter, ax=ax1, label='房价')

# 右图：预测值 vs 真实值
ax2 = fig.add_subplot(122)
ax2.scatter(price, price_pred, alpha=0.6, s=50, edgecolors='black', linewidth=0.5)
ax2.plot([price.min(), price.max()], [price.min(), price.max()], 
         'r--', linewidth=2, label='完美预测线')
ax2.set_xlabel('真实房价（万元）', fontsize=12)
ax2.set_ylabel('预测房价（万元）', fontsize=12)
ax2.set_title(f'预测值 vs 真实值 (R² = {r2:.4f})', fontsize=14, weight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

# 添加信息
info_text = f'拟合方程:\n'
info_text += f'房价 = {model.coef_[0]:.2f}×面积 + {model.coef_[1]:.2f}×房间数 + {model.intercept_:.2f}\n\n'
info_text += f'R²得分: {r2:.4f}\n'
info_text += f'面积权重: {model.coef_[0]:.2f}\n'
info_text += f'房间数权重: {model.coef_[1]:.2f}\n'
info_text += f'偏置: {model.intercept_:.2f}'

ax2.text(0.05, 0.95, info_text, transform=ax2.transAxes, 
        fontsize=10, verticalalignment='top',
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

plt.tight_layout()
plt.savefig('multiple_linear_regression.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"拟合方程: 房价 = {model.coef_[0]:.2f}×面积 + {model.coef_[1]:.2f}×房间数 + {model.intercept_:.2f}")
print(f"R²得分: {r2:.4f}")

矩阵形式求解

最小二乘法的矩阵解：

$$
\mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}
$$

代码实现：

import numpy as np

def multiple_linear_regression(X, y):
    """
    多元线性回归（矩阵解法）
    
    参数:
        X: 特征矩阵 (n_samples, n_features)
        y: 目标向量 (n_samples,)
    
    返回:
        w: 权重向量
        b: 偏置
    """
    # 添加偏置列（全1列）
    X_with_bias = np.column_stack([np.ones(len(X)), X])
    
    # 计算权重：w = (X^T X)^(-1) X^T y
    w = np.linalg.inv(X_with_bias.T @ X_with_bias) @ X_with_bias.T @ y
    
    # 分离偏置和权重
    b = w[0]
    weights = w[1:]
    
    return weights, b

# 示例
np.random.seed(42)
X = np.random.randn(100, 3)
y = 2 * X[:, 0] + 3 * X[:, 1] - X[:, 2] + 5 + np.random.normal(0, 0.1, 100)

weights, bias = multiple_linear_regression(X, y)
print(f"权重: {weights}")
print(f"偏置: {bias}")

特征重要性可视化

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei', 'Microsoft YaHei']
plt.rcParams['axes.unicode_minus'] = False

# 生成数据
np.random.seed(42)
n_samples = 100

# 特征：面积、房间数、距离、楼层
area = np.random.uniform(50, 200, n_samples)
rooms = np.random.uniform(1, 5, n_samples)
distance = np.random.uniform(1, 20, n_samples)
floor = np.random.uniform(1, 30, n_samples)

# 真实关系
price = (0.5 * area + 10 * rooms - 2 * distance + 0.5 * floor + 
         50 + np.random.normal(0, 5, n_samples))

# 创建模型
X = np.column_stack([area, rooms, distance, floor])
feature_names = ['面积', '房间数', '距离', '楼层']
model = LinearRegression()
model.fit(X, price)

# 创建图形
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 绘制每个特征与房价的关系
for idx, (ax, feature, name) in enumerate(zip(axes.flat, X.T, feature_names)):
    ax.scatter(feature, price, alpha=0.5, s=30, edgecolors='black', linewidth=0.5)
    
    # 绘制拟合线（单特征）
    z = np.polyfit(feature, price, 1)
    p = np.poly1d(z)
    x_line = np.linspace(feature.min(), feature.max(), 100)
    ax.plot(x_line, p(x_line), "r--", linewidth=2, 
           label=f'权重: {model.coef_[idx]:.2f}')
    
    ax.set_xlabel(f'{name}', fontsize=12)
    ax.set_ylabel('房价（万元）', fontsize=12)
    ax.set_title(f'{name} vs 房价', fontsize=13, weight='bold')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.suptitle('多元线性回归：各特征与房价的关系', fontsize=16, weight='bold')
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=300, bbox_inches='tight')
plt.show()

# 绘制权重条形图
fig, ax = plt.subplots(1, 1, figsize=(10, 6))
colors = ['green' if w > 0 else 'red' for w in model.coef_]
bars = ax.barh(feature_names, model.coef_, color=colors, alpha=0.7, edgecolor='black')
ax.axvline(x=0, color='black', linestyle='--', linewidth=1)
ax.set_xlabel('权重值', fontsize=12)
ax.set_title('特征权重（正权重表示正相关，负权重表示负相关）', fontsize=14, weight='bold')
ax.grid(axis='x', alpha=0.3)

# 添加数值标签
for i, (bar, w) in enumerate(zip(bars, model.coef_)):
    ax.text(w + (0.1 if w > 0 else -0.1), i, f'{w:.2f}', 
           va='center', ha='left' if w > 0 else 'right', fontsize=11, weight='bold')

plt.tight_layout()
plt.savefig('feature_weights.png', dpi=300, bbox_inches='tight')
plt.show()

第四部分：线性回归的评估指标

常用评估指标

1. 均方误差（Mean Squared Error, MSE）

$$
MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2
$$

特点：

值越小越好
对大误差敏感（平方项）

2. 均方根误差（Root Mean Squared Error, RMSE）

$$
RMSE = \sqrt{MSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}
$$

特点：

与目标值同单位
更容易解释

3. 平均绝对误差（Mean Absolute Error, MAE）

$$
MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|
$$

特点：

对异常值不敏感
更容易理解

4. R²得分（决定系数）

$$
R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}i)^2}{\sum{i=1}^{n}(y_i - \bar{y})^2}
$$

特点：

取值范围：[0, 1]
1 表示完美拟合
0 表示模型不比均值预测好

可视化：评估指标对比

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei', 'Microsoft YaHei']
plt.rcParams['axes.unicode_minus'] = False

# 生成数据
np.random.seed(42)
X = np.linspace(0, 10, 50)
y = 2 * X + 3 + np.random.normal(0, 2, len(X))

# 训练模型
model = LinearRegression()
model.fit(X.reshape(-1, 1), y)
y_pred = model.predict(X.reshape(-1, 1))

# 计算评估指标
mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y, y_pred)
r2 = r2_score(y, y_pred)

# 创建图形
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. 预测值 vs 真实值
ax1 = axes[0, 0]
ax1.scatter(y, y_pred, alpha=0.6, s=50, edgecolors='black', linewidth=0.5)
ax1.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', linewidth=2, label='完美预测')
ax1.set_xlabel('真实值', fontsize=12)
ax1.set_ylabel('预测值', fontsize=12)
ax1.set_title(f'预测值 vs 真实值 (R² = {r2:.4f})', fontsize=13, weight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# 2. 残差图
ax2 = axes[0, 1]
residuals = y - y_pred
ax2.scatter(y_pred, residuals, alpha=0.6, s=50, edgecolors='black', linewidth=0.5)
ax2.axhline(y=0, color='r', linestyle='--', linewidth=2)
ax2.set_xlabel('预测值', fontsize=12)
ax2.set_ylabel('残差（真实值 - 预测值）', fontsize=12)
ax2.set_title('残差图（理想情况：随机分布在0附近）', fontsize=13, weight='bold')
ax2.grid(True, alpha=0.3)

# 3. 评估指标对比
ax3 = axes[1, 0]
metrics = ['MSE', 'RMSE', 'MAE']
values = [mse, rmse, mae]
colors_bar = ['skyblue', 'lightgreen', 'lightcoral']
bars = ax3.bar(metrics, values, color=colors_bar, alpha=0.7, edgecolor='black', linewidth=1.5)
ax3.set_ylabel('误差值', fontsize=12)
ax3.set_title('评估指标对比（越小越好）', fontsize=13, weight='bold')
ax3.grid(axis='y', alpha=0.3)

# 添加数值标签
for bar, val in zip(bars, values):
    height = bar.get_height()
    ax3.text(bar.get_x() + bar.get_width()/2., height,
             f'{val:.2f}', ha='center', va='bottom', fontsize=11, weight='bold')

# 4. R²得分可视化
ax4 = axes[1, 1]
# 计算各部分
ss_res = np.sum((y - y_pred)**2)
ss_tot = np.sum((y - np.mean(y))**2)
ss_reg = ss_tot - ss_res

# 绘制饼图
sizes = [ss_reg, ss_res]
labels = [f'解释的方差\n({ss_reg:.2f})', f'未解释的方差\n({ss_res:.2f})']
colors_pie = ['lightgreen', 'lightcoral']
explode = (0.05, 0.05)

ax4.pie(sizes, explode=explode, labels=labels, colors=colors_pie, 
        autopct='%1.1f%%', shadow=True, startangle=90)
ax4.set_title(f'R²得分 = {r2:.4f}\n(解释的方差比例)', fontsize=13, weight='bold')

plt.suptitle('线性回归评估指标可视化', fontsize=16, weight='bold')
plt.tight_layout()
plt.savefig('regression_metrics.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"R²: {r2:.4f}")

第五部分：线性回归的应用场景

1. 预测问题

应用场景：

房价预测：根据面积、位置、房间数等预测房价
销量预测：根据广告投入、季节、促销等预测销量
股票预测：根据历史数据预测股价（简单模型）
需求预测：根据历史数据预测未来需求

示例：销量预测

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei', 'Microsoft YaHei']
plt.rcParams['axes.unicode_minus'] = False

# 生成示例数据：广告投入与销量
np.random.seed(42)
advertising = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
# 真实关系：销量 = 5 * 广告投入 + 100 + 噪声
sales = 5 * advertising + 100 + np.random.normal(0, 10, len(advertising))

# 训练模型
model = LinearRegression()
model.fit(advertising.reshape(-1, 1), sales)

# 预测未来
future_ads = np.array([110, 120, 130])
future_sales = model.predict(future_ads.reshape(-1, 1))

# 可视化
fig, ax = plt.subplots(1, 1, figsize=(12, 7))

# 绘制历史数据
ax.scatter(advertising, sales, color='blue', s=100, alpha=0.6, 
          label='历史数据', zorder=5, edgecolors='black', linewidth=1.5)

# 绘制拟合线
x_line = np.linspace(0, 140, 100)
y_line = model.predict(x_line.reshape(-1, 1))
ax.plot(x_line, y_line, 'r-', linewidth=3, label='拟合直线', zorder=3)

# 绘制预测数据
ax.scatter(future_ads, future_sales, color='green', s=150, marker='*', 
          label='预测数据', zorder=6, edgecolors='black', linewidth=2)

# 添加预测线
for ad, sale in zip(future_ads, future_sales):
    ax.plot([ad, ad], [0, sale], 'g--', alpha=0.5, linewidth=1)
    ax.text(ad, sale + 10, f'{sale:.0f}', ha='center', fontsize=10, 
           weight='bold', bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.7))

ax.set_xlabel('广告投入（万元）', fontsize=13, weight='bold')
ax.set_ylabel('销量（件）', fontsize=13, weight='bold')
ax.set_title('销量预测：广告投入 vs 销量', fontsize=15, weight='bold')
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)

# 添加信息
info_text = f'拟合方程: 销量 = {model.coef_[0]:.2f} × 广告投入 + {model.intercept_:.2f}\n\n'
info_text += f'预测结果:\n'
for ad, sale in zip(future_ads, future_sales):
    info_text += f'  广告投入 {ad}万元 → 预测销量 {sale:.0f}件\n'

ax.text(0.02, 0.98, info_text, transform=ax.transAxes, 
        fontsize=11, verticalalignment='top',
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

plt.tight_layout()
plt.savefig('sales_prediction.png', dpi=300, bbox_inches='tight')
plt.show()

2. 关系分析

应用场景：

影响因素分析：哪些因素对结果影响最大？
相关性分析：两个变量之间是否存在线性关系？
趋势分析：数据的变化趋势是什么？

3. 异常检测

应用场景：

数据质量检查：识别偏离预测值的数据点
异常值检测：找出不符合规律的数据

第六部分：线性回归在机器学习中的作用

线性回归在机器学习中的地位

1. 基础算法

线性回归是机器学习的基础算法
许多复杂算法的基础（如神经网络）
理解线性回归有助于理解其他算法

2. 基准模型

作为其他模型的对比基准
如果线性回归效果好，说明问题可能比较简单
如果线性回归效果差，可能需要更复杂的模型

3. 可解释性强

权重和偏置有明确的物理意义
容易理解和解释
适合需要解释的场景

在Scikit-learn中的应用

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

# 加载数据（示例：波士顿房价数据集）
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
X, y = housing.data, housing.target

# 数据划分
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 数据标准化（可选，但推荐）
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 创建模型
model = LinearRegression()

# 训练模型
model.fit(X_train_scaled, y_train)

# 预测
y_train_pred = model.predict(X_train_scaled)
y_test_pred = model.predict(X_test_scaled)

# 评估
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)
test_mse = mean_squared_error(y_test, y_test_pred)

print(f"训练集 R²: {train_r2:.4f}")
print(f"测试集 R²: {test_r2:.4f}")
print(f"测试集 MSE: {test_mse:.4f}")
print(f"\n特征权重:")
for i, (name, weight) in enumerate(zip(housing.feature_names, model.coef_)):
    print(f"  {name}: {weight:.4f}")
print(f"偏置: {model.intercept_:.4f}")

线性回归的局限性

1. 线性假设

假设输入和输出之间存在线性关系
对于非线性关系效果差

2. 对异常值敏感

异常值会显著影响拟合结果
需要使用鲁棒回归方法

3. 多重共线性问题

特征之间高度相关时，权重不稳定
需要使用正则化方法

改进方法

1. 多项式回归

将特征进行多项式变换
可以拟合非线性关系

2. 正则化回归

Ridge回归：L2正则化
Lasso回归：L1正则化
Elastic Net：L1+L2正则化

3. 鲁棒回归

对异常值不敏感
使用Huber损失等

第七部分：实际案例：完整的线性回归项目

案例：学生成绩预测

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import seaborn as sns

# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei', 'Microsoft YaHei']
plt.rcParams['axes.unicode_minus'] = False

# 生成模拟数据
np.random.seed(42)
n_students = 200

data = {
    'study_hours': np.random.uniform(1, 10, n_students),
    'homework_completed': np.random.uniform(0, 100, n_students),
    'attendance': np.random.uniform(70, 100, n_students),
    'previous_score': np.random.uniform(60, 90, n_students)
}

# 真实关系：成绩 = 5*学习时间 + 0.3*作业完成度 + 0.2*出勤率 + 0.4*上次成绩 + 20 + 噪声
df = pd.DataFrame(data)
df['score'] = (5 * df['study_hours'] + 
               0.3 * df['homework_completed'] + 
               0.2 * df['attendance'] + 
               0.4 * df['previous_score'] + 
               20 + 
               np.random.normal(0, 5, n_students))

# 准备数据
X = df[['study_hours', 'homework_completed', 'attendance', 'previous_score']]
y = df['score']

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 训练模型
model = LinearRegression()
model.fit(X_train, y_train)

# 预测
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# 评估
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
test_mae = mean_absolute_error(y_test, y_test_pred)

# 创建综合可视化
fig = plt.figure(figsize=(18, 12))

# 1. 特征与目标的关系
for idx, feature in enumerate(X.columns, 1):
    ax = fig.add_subplot(3, 3, idx)
    ax.scatter(X_train[feature], y_train, alpha=0.5, s=30, label='训练集')
    ax.scatter(X_test[feature], y_test, alpha=0.5, s=30, label='测试集', marker='x')
    ax.set_xlabel(feature, fontsize=11)
    ax.set_ylabel('成绩', fontsize=11)
    ax.set_title(f'{feature} vs 成绩', fontsize=12, weight='bold')
    ax.legend(fontsize=9)
    ax.grid(True, alpha=0.3)

# 2. 预测值 vs 真实值（训练集）
ax = fig.add_subplot(3, 3, 5)
ax.scatter(y_train, y_train_pred, alpha=0.6, s=50, edgecolors='black', linewidth=0.5)
ax.plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 
        'r--', linewidth=2)
ax.set_xlabel('真实成绩', fontsize=11)
ax.set_ylabel('预测成绩', fontsize=11)
ax.set_title(f'训练集：预测值 vs 真实值 (R² = {train_r2:.4f})', fontsize=12, weight='bold')
ax.grid(True, alpha=0.3)

# 3. 预测值 vs 真实值（测试集）
ax = fig.add_subplot(3, 3, 6)
ax.scatter(y_test, y_test_pred, alpha=0.6, s=50, edgecolors='black', linewidth=0.5, color='green')
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
        'r--', linewidth=2)
ax.set_xlabel('真实成绩', fontsize=11)
ax.set_ylabel('预测成绩', fontsize=11)
ax.set_title(f'测试集：预测值 vs 真实值 (R² = {test_r2:.4f})', fontsize=12, weight='bold')
ax.grid(True, alpha=0.3)

# 4. 残差图
ax = fig.add_subplot(3, 3, 7)
residuals = y_test - y_test_pred
ax.scatter(y_test_pred, residuals, alpha=0.6, s=50, edgecolors='black', linewidth=0.5)
ax.axhline(y=0, color='r', linestyle='--', linewidth=2)
ax.set_xlabel('预测成绩', fontsize=11)
ax.set_ylabel('残差', fontsize=11)
ax.set_title('残差图', fontsize=12, weight='bold')
ax.grid(True, alpha=0.3)

# 5. 特征权重
ax = fig.add_subplot(3, 3, 8)
colors = ['green' if w > 0 else 'red' for w in model.coef_]
bars = ax.barh(X.columns, model.coef_, color=colors, alpha=0.7, edgecolor='black')
ax.axvline(x=0, color='black', linestyle='--', linewidth=1)
ax.set_xlabel('权重值', fontsize=11)
ax.set_title('特征权重', fontsize=12, weight='bold')
ax.grid(axis='x', alpha=0.3)
for bar, w in zip(bars, model.coef_):
    ax.text(w + (0.1 if w > 0 else -0.1), bar.get_y() + bar.get_height()/2, 
           f'{w:.2f}', va='center', ha='left' if w > 0 else 'right', 
           fontsize=10, weight='bold')

# 6. 评估指标
ax = fig.add_subplot(3, 3, 9)
metrics = ['R²', 'RMSE', 'MAE']
values = [test_r2, test_rmse, test_mae]
colors_bar = ['skyblue', 'lightgreen', 'lightcoral']
bars = ax.bar(metrics, values, color=colors_bar, alpha=0.7, edgecolor='black', linewidth=1.5)
ax.set_ylabel('值', fontsize=11)
ax.set_title('评估指标', fontsize=12, weight='bold')
ax.grid(axis='y', alpha=0.3)
for bar, val in zip(bars, values):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
           f'{val:.2f}', ha='center', va='bottom', fontsize=10, weight='bold')

plt.suptitle('学生成绩预测：完整的线性回归分析', fontsize=16, weight='bold')
plt.tight_layout()
plt.savefig('complete_regression_project.png', dpi=300, bbox_inches='tight')
plt.show()

# 打印结果
print("="*60)
print("线性回归模型结果")
print("="*60)
print(f"\n拟合方程:")
print(f"成绩 = {model.coef_[0]:.2f}×学习时间 + {model.coef_[1]:.2f}×作业完成度 + "
      f"{model.coef_[2]:.2f}×出勤率 + {model.coef_[3]:.2f}×上次成绩 + {model.intercept_:.2f}")
print(f"\n评估指标:")
print(f"  训练集 R²: {train_r2:.4f}")
print(f"  测试集 R²: {test_r2:.4f}")
print(f"  测试集 RMSE: {test_rmse:.2f}")
print(f"  测试集 MAE: {test_mae:.2f}")
print(f"\n特征重要性:")
for name, weight in zip(X.columns, model.coef_):
    print(f"  {name}: {weight:.4f}")
print(f"偏置: {model.intercept_:.4f}")

第八部分：总结与最佳实践

线性回归的核心要点

简单而强大：线性回归虽然简单，但在很多场景下效果很好
可解释性强：权重和偏置有明确的物理意义
计算效率高：训练和预测都很快
作为基准：可以作为其他模型的对比基准

使用线性回归的步骤

数据准备
- 收集数据
- 数据清洗
- 特征选择
模型训练
- 划分训练集和测试集
- 训练模型
- 调整参数
模型评估
- 计算评估指标
- 分析残差
- 检查假设
模型应用
- 进行预测
- 解释结果
- 持续监控

最佳实践

数据预处理
- 处理缺失值
- 标准化特征
- 处理异常值
特征工程
- 选择相关特征
- 处理多重共线性
- 创建新特征（如多项式特征）
模型验证
- 使用交叉验证
- 检查过拟合
- 分析残差
模型改进
- 尝试正则化
- 使用多项式回归
- 考虑非线性模型

常见问题与解决方案

问题1：线性关系假设不成立

解决方案：使用多项式回归或非线性模型

问题2：特征之间存在多重共线性

解决方案：使用Ridge或Lasso回归

问题3：存在异常值

解决方案：使用鲁棒回归方法

问题4：特征数量很多

解决方案：使用Lasso回归进行特征选择

结语

线性回归是机器学习中最基础、最重要的算法之一：

简单易懂：概念直观，容易理解
应用广泛：适用于各种预测和分析场景
基础算法：是理解其他复杂算法的基础
实用性强：在实际项目中经常使用

通过本文的学习，你应该能够：

理解线性回归的基本原理
掌握一元和多元线性回归
知道如何评估线性回归模型
在实际项目中应用线性回归

记住：线性回归虽然简单，但非常实用。在很多场景下，简单的线性回归可能比复杂的模型效果更好！

参考文献

Scikit-learn 官方文档: https://scikit-learn.org/stable/modules/linear_model.html
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning
James, G., et al. (2013). An Introduction to Statistical Learning

分享

线性回归详解

线性回归详解

引言

第一部分：什么是线性回归？

线性回归的直观理解

生活中的例子

线性回归的数学表示

线性回归的目标

第二部分：一元线性回归（Simple Linear Regression）

什么是一元线性回归？

可视化示例：学习时间与成绩

最小二乘法（Least Squares Method）

可视化：最小二乘法的原理

代码实现：手动实现一元线性回归

第三部分：多元线性回归（Multiple Linear Regression）

什么是多元线性回归？

生活中的例子

可视化示例：多特征回归

矩阵形式求解

特征重要性可视化

第四部分：线性回归的评估指标

常用评估指标

1. 均方误差（Mean Squared Error, MSE）

2. 均方根误差（Root Mean Squared Error, RMSE）

3. 平均绝对误差（Mean Absolute Error, MAE）

4. R²得分（决定系数）

可视化：评估指标对比

第五部分：线性回归的应用场景

1. 预测问题

2. 关系分析

3. 异常检测

第六部分：线性回归在机器学习中的作用

线性回归在机器学习中的地位

在Scikit-learn中的应用

线性回归的局限性

改进方法

第七部分：实际案例：完整的线性回归项目

案例：学生成绩预测

第八部分：总结与最佳实践

线性回归的核心要点

使用线性回归的步骤

最佳实践

常见问题与解决方案

结语

参考文献

评论