在数据科学领域,线性回归是一种非常基础但强大的统计方法,用于预测一个或多个自变量与因变量之间的关系。无论是房价预测、股票价格分析还是用户行为研究,线性回归都扮演着重要的角色。本文将从基础概念出发,逐步深入到实际应用,帮助你全面掌握Python中的线性回归技术。
其中,( \epsilon ) 是误差项,表示模型无法解释的部分。
# 堆代码 duidaima.com import numpy as np import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error2.准备数据:
# 生成一些示例数据 X = np.random.rand(100, 1) y = 2 * X.squeeze() + 1 + np.random.randn(100) * 0.13.训练模型:
# 创建线性回归模型 model = LinearRegression() # 拟合模型 model.fit(X, y)4.评估模型:
# 预测 y_pred = model.predict(X) # 计算均方误差 mse = mean_squared_error(y, y_pred) print(f"Mean Squared Error: {mse}")基础实例
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error # 生成示例数据 np.random.seed(0) X = np.random.rand(100, 1) * 100 # 房屋面积 (平方米) y = 2 * X.squeeze() + 100 + np.random.randn(100) * 10 # 房屋价格 (万元) # 划分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 创建线性回归模型 model = LinearRegression() # 拟合模型 model.fit(X_train, y_train) # 预测 y_pred = model.predict(X_test) # 计算均方误差 mse = mean_squared_error(y_test, y_pred) print(f"Mean Squared Error: {mse}") # 可视化结果 plt.scatter(X_test, y_test, color='blue', label='Actual') plt.plot(X_test, y_pred, color='red', label='Predicted') plt.xlabel('House Area (sqm)') plt.ylabel('House Price (10k RMB)') plt.legend() plt.show()进阶实例
在实际应用中,数据往往不是完美的线性关系。假设我们有一个包含多个特征的数据集,如何使用多项式回归来提高模型的性能?
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error # 生成示例数据 np.random.seed(0) X = np.random.rand(100, 1) * 10 y = X**2 + 2 * X + 1 + np.random.randn(100) * 0.1 # 划分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 创建多项式特征 poly_features = PolynomialFeatures(degree=2) X_train_poly = poly_features.fit_transform(X_train) X_test_poly = poly_features.transform(X_test) # 创建线性回归模型 model = LinearRegression() # 拟合模型 model.fit(X_train_poly, y_train) # 预测 y_pred = model.predict(X_test_poly) # 计算均方误差 mse = mean_squared_error(y_test, y_pred) print(f"Mean Squared Error: {mse}") # 可视化结果 plt.scatter(X_test, y_test, color='blue', label='Actual') plt.scatter(X_test, y_pred, color='red', label='Predicted') plt.xlabel('Feature X') plt.ylabel('Target Y') plt.legend() plt.show()实战案例
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error # 加载数据 data = pd.read_csv('house_prices.csv') # 数据预处理 data.dropna(inplace=True) # 删除缺失值 X = data[['area', 'bedrooms', 'bathrooms']] y = data['price'] # 特征标准化 scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # 划分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42) # 创建线性回归模型 model = LinearRegression() # 拟合模型 model.fit(X_train, y_train) # 预测 y_pred = model.predict(X_test) # 计算均方误差 mse = mean_squared_error(y_test, y_pred) print(f"Mean Squared Error: {mse}") # 可视化结果 plt.scatter(y_test, y_pred) plt.xlabel('Actual Prices') plt.ylabel('Predicted Prices') plt.title('Actual vs Predicted House Prices') plt.show()扩展讨论
[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n + \epsilon ]正则化
线性回归虽然强大,但在某些情况下可能不够灵活。此时,可以考虑使用其他回归方法,如决策树回归、随机森林回归、支持向量回归等。