# Machine Learning Likelihood, Loss, Gradient, and Hessian Cheat Sheet

Cheat sheet for likelihoods, loss functions, gradients, and Hessians. This is a living document that I’ll update over time.

# Motivating theory

## Bayes theorem

Bayes’ theorem tells us that the posterior probability of a hypothesis $H$ given data $D$ is

\begin{equation} P(H|D) = \frac{P(H) P(D|H)}{P(D)}, \end{equation}

where

- $P(H \vert D)$ is the
**posterior**probability of the (variable) hypothesis given the (fixed) observed data - $P(H)$ is the
**prior**probability of the hypothesis - $P(D \vert H)$ is the
**likelihood**$\mathcal{L}$, the probability that the observed data was generated by $H$ - $P(D)$ is the marginal likelihood, usually discarded because it’s not a function of $H$.

In supervised machine learning, models are hypotheses and data are $y_i | \mathbf{x}_i$ label-feature vector tuples.

We’re looking for the best model, which maximizes the posterior probability. If the prior is flat ($P(H) = 1$) this reduces to likelihood maximization.

If the prior on model parameters is normal you get Ridge regression. If the prior on model parameters is Laplace distributed you get LASSO.

## Gradient descent

Objectives are derived as the negative of the log-likelihood function. Objects with regularization can be thought of as the negative of the log-posterior probability function, but I’ll be ignoring regularizing priors here.

Objective function is derived as the negative of the log-likelihood function, and can also be expressed as the mean of a loss function $\ell$ over data points.

\[L = -\log{\mathcal{L}} = \frac{1}{N}\sum_i^{N} \ell_i.\]### In linear regression, gradient descent happens in parameter space

For linear models like least-squares and logistic regression,

\[\ell_i = \ell(f(\beta; \mathbf{x}_i))\]where

\[f(\beta; \mathbf{x}_i) = \mathbf{x}_i^T \mathbf{\beta},\]$\beta$ are the coefficients and \(\mathbf{x}_i = 1\) is the $i$-th feature vector. This formulation supports a y-intercept or offset term by defining $x_{i,0} = 1$. The rest of the entries $x_{i,j}: j>0$ are the model features.

Gradient descent minimazation methods make use of the first partial derivative.

\[\begin{equation} \ell^{\prime} = \frac{\partial \ell}{\partial \mathbf{\beta}} = \mathbf{x}_i \frac{\partial \ell}{\partial f} \end{equation}\]Some gradient descent variants,
like Newton-Raphson,
use the second partial derivative or *Hessian*.

### In gradient boosting, gradient descent happens in function space

In gradient boosting,

\[\begin{equation} \ell_i = \ell(f(\mathbf{x}_i)) \end{equation}\]where optimization is done over the set of different functions $\{f\}$ in functional space rather than over parameters of a single linear function. In this case the gradient is taken w.r.t. the function $f$.

\[\begin{equation} \ell^{\prime} = \frac{\partial \ell}{\partial f} \end{equation}\]and the Hessian is

\[\begin{equation} \ell^{\prime\prime} = \frac{\partial^2 \ell}{\partial f^2}. \end{equation}\]All derivatives below will be computed with respect to $f$. If you are using them in a gradient boosting context, this is all you need. If you are using them in a linear model context, you need to multiply the gradient and Hessian by $\mathbf{x}_i$ and $\mathbf{x}_i^2$, respectively.

# Likelihood, loss, gradient, Hessian

The loss is the negative log-likelihood for a single data point.

## Square loss

Used in continous variable regression problems.

**Likelihood**

Start by asserting normally distributed errors.

\[\begin{equation} \prod_{i=1}^N\frac{1}{\sigma\sqrt{2\pi}}\exp{-\frac{(y_i - f(\mathbf{x}_i))^2}{2\sigma^2}} \end{equation}\]**Loss**

**Gradient**

**Hessian**

## Log loss

Used in binary classifiction problems.

**Likelihood**

Start by asserting binary outcomes are Bernoulli distributed.

\begin{equation} \prod_{i=1}^N p(\mathbf{x}_i)^{y_i} (1 - p(\mathbf{x}_i))^{1 - {y_i}} \end{equation}

The model in this case is a function with support $h \in \{-\infty, \infty\}$ that maps to the Bernoulli probability parameter $p$ via the log-odds or “logit” link function.

\begin{equation} f(\mathbf{x}_i) = \log{\frac{p(\mathbf{x}_i)}{1 - p(\mathbf{x}_i)}} \end{equation}

This formulation maps the boundless hypotheses onto probabilities $p \in \{0, 1\}$ by just solving for $p$:

\begin{equation} p(\mathbf{x}_i) = \frac{1}{1 + \exp{(-f(\mathbf{x}_i))}} \end{equation}

**Loss**

For labels following the binary indicator convention $y \in \{0, 1\}$, all of the following are equivalent. The easiest way to prove they are equivalent is to plug in $y = 0$ and $y = 1$ and rearrange.

\[\begin{equation} \begin{split} \ell & = -y_i\log{p(\mathbf{x}_i)} - (1 - y_i)\log{(1 - p(\mathbf{x}_i))} \\ & = y_i \log{(1 + \exp{(-f(\mathbf{x}_i))})} \\ & \qquad + (1 - y_i) \log{(1 + \exp{(f(\mathbf{x}_i))})} \\ & = - y_i f(\mathbf{x}_i) + \log{(1 + \exp{(f(\mathbf{x}_i))})} \end{split} \end{equation}\]The first form is useful if you want to use different link functions.

For labels following the transformed convention $z = 2y-1 \in \{-1, 1\}$:

\[\begin{equation} \ell = \log{(1 + exp{(-z f(\mathbf{x}_i))})} \end{equation}\]**Gradient**

**Hessian**

## Quantile regression

**Likelihood**

I have not yet seen somebody write down a motivating likelihood function for quantile regression loss.

**Loss**

Sometimes called the pinball loss.

\[\begin{equation} \begin{split} \ell & = (y_i - f(\mathbf{x}_i)) ( \tau - \mathbb{1}_{y_i < f(\mathbf{x}_i)} ) \\ & = \sum_{y_i \geq f(\mathbf{x}_i)}\tau (y_i - f(\mathbf{x}_i)) \\ & \qquad - \sum_{y_i < f(\mathbf{x}_i)}(1 - \tau) (y_i - f(\mathbf{x}_i)) \end{split} \end{equation}\]**Gradient**

**Hessian**

## Mean absolute deviation

Mean absolute deviation is quantile regression at $\tau=0.5$.

## Cox proportional hazards

**Likelihood**

Start from the Cox proportional hazards partial likelihood function. The partial likelihood is, as you might guess, just part of a larger likelihood, but it is sufficient for maximum likelihood estimation and therefore regression.

\[\begin{equation} L(f) = \prod_{i:C_i = 1} \frac{\exp{f_i}}{\sum_{j:t_j \geq t_i} \exp{f_j}} \end{equation}\]Using the analogy of subscribers to a business who may or may not renew from period to period, following is the unique terminology of survival analysis.

- $i$ and $j$ index users.
- $C_i = 1$ is a cancelation or churn event for user $i$ at time $t_i$
- $C_i = 0$ is a renewal or survival event for user $i$ at time $t_i$
- Subscribers $i:C_i = 1$ are users who canceled at time $t_i$.
- $j:t_j \geq t_i$ are users who have survived up to and including time $t_i$,
which is the instant before subscriber $i$ canceled their subscription
and churned out of the business. This is called the
*risk set*, because they are the users at risk of canceling at the time user $i$ canceled. The risk set includes user $i$.

In clinical studies, users are subjects and churn is non-survival, i.e. death.

**Loss**

where $\delta_i$ is the churn/death indicator.

**Gradient**

The efficient algorithm to compute the gradient and hessian involves ordering the $n$ survival data points, which are index by $i$, by time $t_i$. This turns $n^2$ time complexity into $n\log{n}$ for the sort followed by $n$ for the progressive total-loss compute (ref).

For linear regression, the gradient for instance $i$ is

\[\begin{equation} \frac{\partial \ell_i}{\partial \beta} = \delta_i \left[ - \mathbf{x}_i + \frac{\sum_{j:t_j \geq t_i} \mathbf{x}_j \exp{f(\beta; \mathbf{x}_j)}}{\sum_{j:t_j \geq t_i} \exp{f(\beta; \mathbf{x}_j)}} \right] \end{equation}\]For gradient boosting, the gradient for instance $i$ is

\[\begin{equation} \frac{\partial \ell_i}{\partial f} = \delta_i \left[ - 1 + \sum_{j=1}^n \delta_j \mathbb{1}_{t_i \geq t_j} \frac{\exp{(f(\mathbf{x}_i))}}{\sum_{r=1}^n \mathbb{1}_{t_r \geq t_j} \exp{(f(\mathbf{x}_r))}} \right] \end{equation}\]**Hessian**

To be written.

# Backlog

- Cross entropy for multiclass problems
- Accelerated failure time
- Hinge
- Huber
- Poisson
- Kullback-Leibler

## Comments