Regularization

Regularization

When we use regression models to train some data, there is a good chance that the model will overfit the given training data set. Regularization helps sort this overfitting problem by restricting the degrees of freedom of a given equation i.e. simply reducing the number of degrees of a polynomial function by reducing their corresponding weights.
In a linear equation, we do not want huge weights/coefficients as a small change in weight can make a large difference for the dependent variable (Y). So, regularization constraints the weights of such features to avoid overfitting.

To get the best fit line, we try to minimize the cost function given as:

To regularize the model, a Shrinkage penalty is added to the cost function.

Types of Regularization:-

LASSO(Least Absolute Shrinkage and Selection Operator) Regression (L1 Form)

LASSO regression penalizes the model based on the sum of the magnitude of the coefficients. The regularization term is given by

regularization= $λ * \sum | β_{j} |$

Where λ is the shrinkage factor.

and hence the formula for loss after regularization is:

Ridge Regression (L2 Form)

Ridge regression penalizes the model based on the sum of squares of the magnitude of the coefficients. The regularization term is given by

regularization= $λ * \sum | β_{j}^{2} |$

Where λ is the shrinkage factor.

and hence the formula for loss after regularization is:

This value of lambda can be anything and should be calculated by cross-validation as to what suits the model.

Difference between Ridge and Lasso

Ridge regression shrinks the coefficients for those predictors which contribute very little in the model but have huge weights, very close to zero. But it never makes them exactly zero. Thus, the final model will still contain all those predictors, though with fewer weights. This doesn’t help in interpreting the model very well. This is where Lasso regression differs from Ridge regression. In Lasso, the L1 penalty does reduce some coefficients exactly to zero when we use a sufficiently large tuning parameter λ. So, in addition to regularizing, lasso also performs feature selection.

Why use Regularization?

Regularization helps to reduce the variance of the model, without a substantial increase in the bias. If there is variance in the model that means that the model won’t fit well for a dataset different than training data. The tuning parameter λ controls this bias and variance tradeoff. When the value of λ is increased up to a certain limit, it reduces the variance without losing any important properties in the data. But after a certain limit, the model will start losing some important properties which will increase the bias in the data. Thus, the selection of the good value of λ is the key. The value of λ is selected using cross-validation methods. A set of λ is selected and cross-validation error is calculated for each value of λ and that value of λ is selected for which the cross-validation error is minimum.

Elastic Net

According to the Hands-on Machine Learning book, elastic Net is a middle ground between Ridge Regression and Lasso Regression. The regularization term is a simple mix of both Ridge and Lasso’s regularization terms, and you can control the mix ratio α.

where α is the mixing parameter between ridge (α = 0) and lasso (α = 1).

When should you use plain Linear? Regression (i.e., without any regularization), Ridge, Lasso, or Elastic Net?

According to the Hands-on Machine Learning book, it is almost always preferable to have at least a little bit of regularization, so generally, you should avoid plain Linear Regression. Ridge is a good default, but if you suspect that only a few features are actually useful, you should prefer Lasso or Elastic Net since they tend to reduce the useless features’ weights down to zero as we have discussed. In general, Elastic Net is preferred over Lasso since Lasso may behave erratically when the number of features is greater than the number of training instances or when several features are strongly correlated.