ISLR_ch6.2 Shrinkage Methods

Shrinkage Methods v.s. Subset Selection:

Subset selection methods described involve using least squares to fit a linear model that contains a subset of the predictors.
Shrinkage Methods fit a model containing all p predictors by constraining or regularizing the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero.

Ridge Regression

Recall least squares:

\begin{align} RSS=\sum_{i=1}^n\left( y_i-\beta_0-\sum_{j=1}^px_{ij}\beta_j \right)^2 \end{align}

Ridge regression coefficient estimates $\hat{\beta}^R$ are the values that minimize

\begin{align} \sum_{i=1}^n\left( y_i-\beta_0-\sum_{j=1}^px_{ij}\beta_j \right)^2+\lambda\sum_{j=1}^p\beta_j^2=RSS+\lambda\sum_{j=1}^p\beta_j^2 \end{align}

Trade-off:

Ridge regression seeks coefficient estimates that fit the data well, by making the RSS small.
shrinkage penalty $\lambda\sum_{j=1}^p\beta_j^2$ is small when β1, . . . , βp are close to zero, and so it has the effect of shrinking the estimates of βj towards zero

Standardization:

scale equivariant: The standard least squares coefficient estimates are scale equivariant: multiplying Xj by a constant c simply leads to a scale scaling of the least squares coefficient estimates by a factor of 1/c.
$X_{j,\lambda}^\beta$ will depend not only on the value of λ, but also on the scaling of the jth predictor, and the scaling of the other predictors. It is best to apply ridge regression after standardizing the predictors \begin{align} \tilde{x}_{ij}=\frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij}-\bar{x}_j)^2}} \end{align} The denominator is the estimated standard deviation of the jth predictor

Ridge Regression Improves Over Least Squares

bias-variance trade-off
- Ridge regression’s advantage over least squares is rooted in the bias-variance trade-off. As λ increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias.
- At the least squares coefficient estimates, which correspond to ridge regression with λ = 0, the variance is high but there is no bias. But as λ increases, the shrinkage of the ridge coefficient estimates leads to a substantial reduction in the variance of the predictions, at the expense of a slight increase in bias.
ridge regression works best in situations where the least squares estimates have high variance
computational advantages over best subset selection

The Lasso

The lasso coefficients, $\hat{\beta}_\lambda^L$, minimize the quantity \begin{align} \sum_{i=1}^n\left( y_i-\beta_0-\sum_{j=1}^px_{ij}\beta_j \right)^2+\lambda\sum_{j=1}^p|\beta_j|=RSS+\lambda\sum_{j=1}^p|\beta_j| \end{align}

Another Formulation for Ridge Regression and the Lasso

The lasso and ridge regression coefficient estimates solve the problems

\begin{align} minimize_\beta \left\{\sum_{i=1}^n\left( y_i-\beta_0-\sum_{j=1}^px_{ij}\beta_j \right)^2\right\}\,\, subject\, to \, \sum_{j=1}^p|\beta_j|\leq s \
minimize_beta \left\{\sum_{i=1}^n\left( y_i-\beta_0-\sum_{j=1}^px_{ij}\beta_j \right)^2\right\}\,\, subject\, to \, \sum_{j=1}^p\beta_j^2\leq s \end{align}

When we perform the lasso we are trying to find the set of coefficient estimates that lead to the smallest RSS, subject to the constraint that there is a budget s for how large $\sum_{j=1}^p|\beta_j|$ can be. When s is extremely large, then this budget is not very restrictive, and so the coefficient estimates can be large

A close connection between the lasso, ridge regression, and best subset selection:

best subset selection is equivelant to : \begin{align} minimize_beta \left\{\sum_{i=1}^n\left( y_i-\beta_0-\sum_{j=1}^px_{ij}\beta_j \right)^2\right\}\,\, subject\, to \, \sum_{j=1}^pI(\beta_j\neq 0)\leq s \end{align}

Therefore, we can interpret ridge regression and the lasso as computationally feasible alternatives to best subset selection.

The Variable Selection Property of the Lasso

The lasso and ridge regression coefficient estimates are given by the first point at which an ellipse contacts the constraint region.

ridge regression: circular constraint with no sharp points, so the ridge regression coefficient estimates will be exclusively non-zero.

the lasso: constraint has corners at each of the axes, and so the ellipse will often intersect the constraint region at an axis.

the $l_1$ penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large.
Hence, much like best subset selection, the lasso performs variable selection

lasso yields sparse models

Comparing the Lasso and Ridge Regression

SAME: Ridge & Lasso all can yield a reduction in variance at the expense of a small increase in bias, and consequently can generate more accurate predictions.

DIFFERENCES:

Unlike ridge regression, the lasso performs variable selection, and hence results in models that are easier to interpret.
ridge regression outperforms the lasso in terms of prediction error in this setting

Suitable setting:

Lasso: perform better in a setting where a relatively small number of predictors have substantial coefficients, and the remaining predictors have coefficients that are very small or that equal zero.
Ridge regression: perform better when the response is a function of many predictors, all with coefficients of roughly equal size.
The number of predictors that is related to the response is never known a priori for real data sets. Cross-validation can be used in order to determine which approach is better on a particular data set.

ISLR_ch5.2 Potential Problems

ISLR_ch5.2 Potential ProblemsApproach: A data set, which we call Z, that contains n observations. We randomly select n

2019-09-23 An Introduction to statistical Learning

Machine Learning notes

ISLR_ch6.4 Considerations_In_High_Dimensions

ISLR_ch6.4 Considerations_In_High_DimensionsHigh-Dimensional DataHigh-dimensional: Data sets containing more features th

2019-09-23 An Introduction to statistical Learning

Machine Learning notes