What is overfitting? Here are a few ways of explaining it:
What are some ways to overfit the data?
An overly complex model has low bias but high variance.
Question: Are linear regression models high bias/low variance, or low bias/high variance?
Answer: High bias/low variance (generally speaking)
Great! So as long as we don't train and test on the same data, we don't have to worry about overfitting, right? Not so fast....
Linear models can overfit if you include irrelevant features.
Question: Why would that be the case?
Answer: Because it will learn a coefficient for any feature you feed into the model, regardless of whether that feature has the signal or the noise.
This is especially a problem when p (number of features) is close to n (number of observations), because that model will naturally have high variance.
Linear models can also overfit when the included features are highly correlated. From the scikit-learn documentation:
"...coefficient estimates for Ordinary Least Squares rely on the independence of the model terms. When terms are correlated and the columns of the design matrix X have an approximate linear dependence, the design matrix becomes close to singular and as a result, the least-squares estimate becomes highly sensitive to random errors in the observed response, producing a large variance."
Linear models can also overfit if the coefficients are too large.
Question: Why would that be the case?
Answer: Because the larger the absolute value of the coefficient, the more power it has to change the predicted response. Thus it tends toward high variance, which can result in overfitting.
Regularization is a method for "constraining" or "regularizing" the size of the coefficients, thus "shrinking" them towards zero. It tends to reduce variance more than it increases bias, and thus minimizes overfitting.
Common regularization techniques for linear models:
Lasso regularization is useful if we believe many features are irrelevant, since a feature with a zero coefficient is essentially removed from the model. Thus, it is a useful technique for feature selection.
How does regularization work?
Our goal is to locate the optimum model complexity, and thus regularization is useful when we believe our model is too complex.
It's usually recommended to standardize your features when using regularization.
Question: Why would that be the case?
Answer: If you don't standardize, features would be penalized simply because of their scale. Also, standardizing avoids penalizing the intercept (which wouldn't make intuitive sense).