ISLR_ch6.2 Shrinkage Methods
Shrinkage Methods v.s. Subset Selection:
- Subset selection methods described involve using least squares to fit a linear model that contains a subset of the predictors.
- Shrinkage Methods fit a model containing all p predictors by constraining or regularizing the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero.
Ridge Regression
Recall least squares:
\begin{align} RSS=\sum_{i=1}^n\left( y_i-\beta_0-\sum_{j=1}^px_{ij}\beta_j \right)^2 \end{align}
Ridge regression coefficient estimates $\hat{\beta}^R$ are the values that minimize
\begin{align} \sum_{i=1}^n\left( y_i-\beta_0-\sum_{j=1}^px_{ij}\beta_j \right)^2+\lambda\sum_{j=1}^p\beta_j^2=RSS+\lambda\sum_{j=1}^p\beta_j^2 \end{align}
Trade-off:
- Ridge regression seeks coefficient estimates that fit the data well, by making the RSS small.
- shrinkage penalty $\lambda\sum_{j=1}^p\beta_j^2$ is small when β1, . . . , βp are close to zero, and so it has the effect of shrinking the estimates of βj towards zero
Standardization:
scale equivariant: The standard least squares coefficient estimates are scale equivariant: multiplying Xj by a constant c simply leads to a scale scaling of the least squares coefficient estimates by a factor of 1/c.
$X_{j,\lambda}^\beta$ will depend not only on the value of λ, but also on the scaling of the jth predictor, and the scaling of the other predictors. It is best to apply ridge regression after standardizing the predictors \begin{align} \tilde{x}_{ij}=\frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij}-\bar{x}_j)^2}} \end{align} The denominator is the estimated standard deviation of the jth predictor
Ridge Regression Improves Over Least Squares
bias-variance trade-off
- Ridge regression’s advantage over least squares is rooted in the bias-variance trade-off. As λ increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias.
- At the least squares coefficient estimates, which correspond to ridge regression with λ = 0, the variance is high but there is no bias. But as λ increases, the shrinkage of the ridge coefficient estimates leads to a substantial reduction in the variance of the predictions, at the expense of a slight increase in bias.
ridge regression works best in situations where the least squares estimates have high variance
computational advantages over best subset selection
The Lasso
The lasso coefficients, $\hat{\beta}_\lambda^L$, minimize the quantity \begin{align} \sum_{i=1}^n\left( y_i-\beta_0-\sum_{j=1}^px_{ij}\beta_j \right)^2+\lambda\sum_{j=1}^p|\beta_j|=RSS+\lambda\sum_{j=1}^p|\beta_j| \end{align}
Another Formulation for Ridge Regression and the Lasso
The lasso and ridge regression coefficient estimates solve the problems
\begin{align}
minimize_\beta \left\{\sum_{i=1}^n\left( y_i-\beta_0-\sum_{j=1}^px_{ij}\beta_j \right)^2\right\}\,\, subject\, to \, \sum_{j=1}^p|\beta_j|\leq s \
minimize_beta \left\{\sum_{i=1}^n\left( y_i-\beta_0-\sum_{j=1}^px_{ij}\beta_j \right)^2\right\}\,\, subject\, to \, \sum_{j=1}^p\beta_j^2\leq s
\end{align}
When we perform the lasso we are trying to find the set of coefficient estimates that lead to the smallest RSS, subject to the constraint that there is a budget s for how large $\sum_{j=1}^p|\beta_j|$ can be. When s is extremely large, then this budget is not very restrictive, and so the coefficient estimates can be large
A close connection between the lasso, ridge regression, and best subset selection:
best subset selection is equivelant to : \begin{align} minimize_beta \left\{\sum_{i=1}^n\left( y_i-\beta_0-\sum_{j=1}^px_{ij}\beta_j \right)^2\right\}\,\, subject\, to \, \sum_{j=1}^pI(\beta_j\neq 0)\leq s \end{align}
Therefore, we can interpret ridge regression and the lasso as computationally feasible alternatives to best subset selection.
The Variable Selection Property of the Lasso
The lasso and ridge regression coefficient estimates are given by the first point at which an ellipse contacts the constraint region.
ridge regression: circular constraint with no sharp points, so the ridge regression coefficient estimates will be exclusively non-zero.
the lasso: constraint has corners at each of the axes, and so the ellipse will often intersect the constraint region at an axis.
- the $l_1$ penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large.
- Hence, much like best subset selection, the lasso performs variable selection
lasso yields sparse models
Comparing the Lasso and Ridge Regression
SAME: Ridge & Lasso all can yield a reduction in variance at the expense of a small increase in bias, and consequently can generate more accurate predictions.
DIFFERENCES:
- Unlike ridge regression, the lasso performs variable selection, and hence results in models that are easier to interpret.
- ridge regression outperforms the lasso in terms of prediction error in this setting
Suitable setting:
- Lasso: perform better in a setting where a relatively small number of predictors have substantial coefficients, and the remaining predictors have coefficients that are very small or that equal zero.
- Ridge regression: perform better when the response is a function of many predictors, all with coefficients of roughly equal size.
- The number of predictors that is related to the response is never known a priori for real data sets. Cross-validation can be used in order to determine which approach is better on a particular data set.