3 Steps to create a linear regression model

There is a LOT of information in the previous chapter on how to create a linear regression model, and it might feel a little overwhelming! The steps below will outline each of the stages, in order, you will need to complete to create a model and then improve it.

3.1 Import the necessary packages

You can’t do anything without these! You will need:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.preprocessing import MinMaxScaler, StandardScaler

import statsmodels.stats.api as sms
from statsmodels.compat import lzip

from scipy.stats import boxcox

3.2 Prepare your data

Load in the data and perform an inital examination: .head(), .describe() or .info() are just some of the methods you can use.
What data types do you have? How many columns and rows? Are there any missing values? If so, deal with them. Make sure your data is as clean as possible.
Choose your target (dependent variable) and features (independent variables). For linear regression, ensure your target is continuous.

3.3 Baseline model training

Split your data into training and testing sets using train_test_split to evaluate your model’s performance on unseen data.
Create a linear regression model using LinearRegression() from scikit-learn.
Fit the model to your training data using .fit() method.

3.4 Model Evaluation

Use .score() method to calculate the \(R^2\) (coefficient of determination) for both training and testing sets to evaluate how well your model explains the variation of the target variable.
Predict on your testing set using .predict() and evaluate error metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE) to understand the average error your model makes in its predictions.

3.5 Check for Assumptions and Model Improvement

Linearity: Confirm that there is a linear relationship between the predictors and the target variable. If necessary, consider transforming your predictors.
Multicollinearity: Use Variance Inflation Factor (VIF) to check for multicollinearity among predictors. Remove or combine features that cause high multicollinearity.
Homoscedasticity: Ensure the residuals have constant variance across different values of predictors. If heteroscedasticity is present, consider transforming the target variable.
Normality of Residuals: Check if the residuals are normally distributed. Non-normal residuals might indicate the need for transforming the target variable or predictors.

3.6 Feature Engineering

Include other features if they make sense for your model.
(Optional) use domain knowledge to create new features that might be relevant to the target variable.

3.7 Scaling and Regularisation

Apply feature scaling through methods like Min-Max Scaling or Standardisation to ensure all features contribute equally to the model.
Experiment with Ridge, Lasso, or ElasticNet regularisation to reduce overfitting and handle multicollinearity by penalising large coefficients.

3.8 Hyperparameter Tuning

Adjust the regularisation strength (alpha) and the mix between L1 and L2 penalties (l1_ratio in ElasticNet) using cross-validation to find the optimal model.

3.9 Cross-Validation

Use cross-validation techniques like K-fold cross-validation to assess the model’s stability and performance across different subsets of the data.

3.10 Final Model Selection

Choose the model and preprocessing steps that provide the best balance between model complexity, performance on the training set, and generalisation to new data.