ISLR_ch3.5 Comparison_of_Linear_Regression_with_K-Nearest_Neighbors

ISLR_ch3.5 Comparison_of_Linear_Regression_with_K-Nearest_Neighbors

Parametric v.s. Non-parametric

Linear regression is an example of a parametric approach because it assumes a linear functional form for f(X).

Parametric methods

  • Advantages:
    • Easy to fit, because one need estimate only a small number of coefficients.
    • Simple interpretations, and tests of statistical significance can be easily performed
  • Disadvantage:
    • Strong assumptions about the form of f(X). If the specified functional form is far from the truth, and prediction accuracy is our goal, then the parametric method will perform poorly.

Non-parametric methods

  • Do not explicitly assume a parametric form for f(X), and thereby provide an alternative and more flexible approach for performing regression.
  • K-nearest neighbors regression (KNN regression)

KNN Regression

Given a value for $K$ and a prediction point $x_0$, KNN regression first identifies the $K$ training observations that are closest to $x_0$, represented by $N_0$. It then estimates $f(x_0)$ using the average of all the training responses in $N_0$.

\begin{align} \hat{f}(x_0)=\frac{1}{K}\sum_{x_i\in N_0}y_i \end{align}

  • The optimal value for K will depend on the bias-variance trade-off.
    • A small value for K provides the most flexible fit, which will have low bias but high variance. This variance is due to the fact that the prediction in a given region is entirely dependent on just one observation.
    • A larger values of K provide a smoother and less variable fit; the prediction in a region is an average of several points, and so changing one observation has a smaller effect. However, the smoothing may cause bias by masking some of the structure in f(X)

The parametric approach will outperform the nonparametric approach if the parametric form that has been selected is close to the true form of f.

  • A non-parametric approach incurs a cost in variance that is not offset by a reduction in bias
  • KNN performs slightly worse than linear regression when the relationship is linear, but much better than linear regression for non-linear situations.

The increase in dimension has only caused a small deterioration in the linear regression test set MSE, but it has caused more than a ten-fold increase in the MSE for KNN.

  • This decrease in performance as the dimension increases is a common problem for KNN, and results from the fact that in higher dimensions there is effectively a reduction in sample size.$\Rightarrow$ curse of dimensionality
  • As a general rule, parametric methods will tend to outperform non-parametric approaches when there is a small number of observations per predictor.

  TOC