4  FAQ - Linear Regression

Q: When should I use linear regression?

A: Use linear regression when you suspect a linear relationship between the independent and dependent variables. It’s ideal for predicting a continuous variable, understanding relationships between variables, or forecasting trends.

Q: How do I choose the right variables for my linear regression model?

A: Select variables based on theoretical understanding, previous research, or exploratory data analysis. Use scatter plots, correlation coefficients, and domain knowledge to identify relevant predictors. Avoid variables with little to no variation, irrelevant features, or those highly correlated with other predictors (to prevent multicollinearity).

Q: What is multicollinearity, and why is it a problem?

A: Multicollinearity occurs when two or more independent variables in a regression model are highly correlated, leading to unreliable and unstable parameter estimates. It makes it hard to tell the individual effect of each predictor on the dependent variable.

Q: How do I check for and address multicollinearity?

A: Check for multicollinearity using Variance Inflation Factor (VIF) scores. A VIF score above 5 indicates a multicollinearity problem. Address it by removing or combining correlated variables, or by using regularisation techniques like Ridge or Lasso regression.

Q: Why do I need to split my data into training and testing sets?

A: Splitting data into training and testing sets allows you to train your model on one subset of the data and then test its performance on an unseen subset. This helps evaluate the model’s ability to generalise to new data and prevent overfitting.

Q: What are Ridge, Lasso, and ElasticNet regression?

A: Ridge and Lasso are types of regularised linear regression that penalize large coefficients to reduce overfitting and handle multicollinearity. Ridge regression penalises the square of coefficients, while Lasso penalises their absolute value, potentially reducing some coefficients to zero. ElasticNet combines both penalties, offering a compromise between Ridge and Lasso.

Q: How do I interpret the coefficients in a linear regression model?

A: Each coefficient represents the change in the dependent variable for a one-unit change in the corresponding independent variable, holding all other variables constant. Positive coefficients indicate a positive relationship, and negative coefficients indicate a negative relationship between the predictor and the dependent variable.

Q: How can I improve the performance of my linear regression model?

A: To improve your model, consider feature engineering, removing outliers, transforming skewed variables, or using regularisation techniques. Also, ensure your data meets the assumptions of linear regression, such as linearity, homoscedasticity, and normal distribution of residuals.

Q: Why can’t I get an R squared score of 100%?

A: Achieving a 100% score in a linear regression model is extremely rare and practically unattainable for real-world data due to various factors such as noise, outliers, and the inherent randomness in the data. A model that perfectly fits the training data may also be overfitted, meaning it has learned the noise in the training data to an extent that it performs poorly on unseen data.

Q: What score should I aim for?

A: The ideal score depends on the context of the problem and the complexity of the data. In many real-world scenarios, an R-squared score of 70% to 90% is considered good, but this can vary. For some complex problems, even lower scores might be acceptable. It’s crucial to compare your model’s performance to baseline models or previous research in the same area to set realistic expectations. Aim for a balance between a model that fits the data well and one that maintains the ability to generalise to new, unseen data.