Penalized Likelihood

When a large number of variables/predictors are avaialable for predicting the response, the conventional MLE/Least Square Error

Used when we want to penlize the model complexity to encourage a more simpler model

l_{p} (θ; x) = l (θ; x) - λ P (θ)

Ridge Regression

Y_{i} = β_{0} + β_{1} x_{i 1}

\begin{aligned} \hat{β} = (X^{T} X)^{- 1} X^{T} Y & v a r (\hat{B}) & = v a r (X^{T} X)^{- 1} X^{T} Y \\ v a r (Y) = σ^{2} & = \\ v a r (\hat{y}) & = v a r (X \hat{β}) & = v a r (X (X^{T} X)^{- 1} X^{T} Y) \\ = X v a r (\hat{β}) X^{T} = σ^{2} (X (X^{T} X)^{- 1} X^{T}) \end{aligned}

The ordinary least square estimate ${\hat{β}}^{0}$ of $β$ is the solution that minimizes the residual sum of aquare

= a r g m i n {\sum_{i = 1}^{n} ϵ_{i}}

\begin{aligned} {\hat{β}}_{ridge} & = \underset{β}{argmin} {\sum \times x + λ \sum β_{j}^{2}} \\ = \\ = (X^{T} X + λ I)^{- 1} X^{T} y \end{aligned}

Minimizing the terms of error + square betas, we minimize the effect of relying on specific betas to explain y (overfitting). The act of minimizing ${\hat{β}}_{ridge}$ approximates all the betas with its dimensions to a circle/ x-d spheres

\frac{\partial - l_{p} (β)}{\partial β_{j}} |_{β = β}

\begin{array}{r} λ \in {λ_{min}, \dots, λ_{max}} \\ λ_{max} = min λ s.t. all β_{j} = 0, j \neq 0 \end{array}

show $V a r ({\hat{β}}_{ridge}) \leq v a r (\hat{β} u n d s i m)$

Least Absolute Shrinkage and Selection Operation (LASSO)