闽公网安备 35020302035485号
在数据科学领域,线性回归是一种非常基础但强大的统计方法,用于预测一个或多个自变量与因变量之间的关系。无论是房价预测、股票价格分析还是用户行为研究,线性回归都扮演着重要的角色。本文将从基础概念出发,逐步深入到实际应用,帮助你全面掌握Python中的线性回归技术。
其中,( \epsilon ) 是误差项,表示模型无法解释的部分。
# 堆代码 duidaima.com import numpy as np import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error2.准备数据:
# 生成一些示例数据 X = np.random.rand(100, 1) y = 2 * X.squeeze() + 1 + np.random.randn(100) * 0.13.训练模型:
# 创建线性回归模型 model = LinearRegression() # 拟合模型 model.fit(X, y)4.评估模型:
# 预测
y_pred = model.predict(X)
# 计算均方误差
mse = mean_squared_error(y, y_pred)
print(f"Mean Squared Error: {mse}")
基础实例import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# 生成示例数据
np.random.seed(0)
X = np.random.rand(100, 1) * 100 # 房屋面积 (平方米)
y = 2 * X.squeeze() + 100 + np.random.randn(100) * 10 # 房屋价格 (万元)
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 创建线性回归模型
model = LinearRegression()
# 拟合模型
model.fit(X_train, y_train)
# 预测
y_pred = model.predict(X_test)
# 计算均方误差
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
# 可视化结果
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_test, y_pred, color='red', label='Predicted')
plt.xlabel('House Area (sqm)')
plt.ylabel('House Price (10k RMB)')
plt.legend()
plt.show()
进阶实例在实际应用中,数据往往不是完美的线性关系。假设我们有一个包含多个特征的数据集,如何使用多项式回归来提高模型的性能?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# 生成示例数据
np.random.seed(0)
X = np.random.rand(100, 1) * 10
y = X**2 + 2 * X + 1 + np.random.randn(100) * 0.1
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 创建多项式特征
poly_features = PolynomialFeatures(degree=2)
X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)
# 创建线性回归模型
model = LinearRegression()
# 拟合模型
model.fit(X_train_poly, y_train)
# 预测
y_pred = model.predict(X_test_poly)
# 计算均方误差
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
# 可视化结果
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.scatter(X_test, y_pred, color='red', label='Predicted')
plt.xlabel('Feature X')
plt.ylabel('Target Y')
plt.legend()
plt.show()
实战案例import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# 加载数据
data = pd.read_csv('house_prices.csv')
# 数据预处理
data.dropna(inplace=True) # 删除缺失值
X = data[['area', 'bedrooms', 'bathrooms']]
y = data['price']
# 特征标准化
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# 创建线性回归模型
model = LinearRegression()
# 拟合模型
model.fit(X_train, y_train)
# 预测
y_pred = model.predict(X_test)
# 计算均方误差
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
# 可视化结果
plt.scatter(y_test, y_pred)
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Actual vs Predicted House Prices')
plt.show()
扩展讨论[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n + \epsilon ]正则化
线性回归虽然强大,但在某些情况下可能不够灵活。此时,可以考虑使用其他回归方法,如决策树回归、随机森林回归、支持向量回归等。