A comparison of robust regression methods based on iteratively reweighted least squares and maximum likelihood estimation post-date: 10/16/2016
 Time series data often have heavy-tailed noise. This is a form of heteroskedasticity, and it can be found throughout economics, finance, and engineering. In this post I share a few well known methods in the robust linear regression literature, using a toy dataset as an example throughout. In this particular problem there are five datasets consisting of responses, $$y$$, that are linear in a covariate, $$x$$, according to $y_{i} = a + bx_{i} + \epsilon_{i} \tag{Eq. 1}\label{Eq. 1}$ where $$\epsilon_{i}$$ is an independent noise term that varies in $$x$$, but has mean zero conditioned on $$x$$. The goal is to provide the optimal set of point estimates ($$\hat{a}, \hat{b}$$) given the data. This post explores a few well known approaches to fitting linear models with heterogeneous noise, using our toy problem as an example. Ordinary Least Squares (OLS) Figure 1 plots raw data along with the ordinary least squares (OLS) fits and the residuals, $$\left\lvert \epsilon \right\rvert$$. As a reminder, the analytic solution to minimize the sum of the squared residuals, $$\sum_{i=0}^{n} (y_{i} - a - bx_{i})^{2}$$, is $\hat{\Theta}_{OLS} = (X^{T}X)^{-1}X^{T}Y \\\\\ \\\\\ \mathsf{where} \\\\\ \\\\\ \hat{\Theta}_{OLS} = \begin{bmatrix} \hat{a} \\ \hat{b}. \\ \end{bmatrix} \tag{Eq. 2}\label{Eq. 2}$ Figure 1. Datasets 1-5 along with ordinary least squares fits. The residual plot is shown in the bottom right. There are a total of 350 data points in these 5 datasets. The breakdown is 100-100-50-50- 50. Before we start, a few things are evident from looking at the data. In the first two datasets (data_1_1 and data_1_2), which have the most examples, $$\overline{x} \approx 0$$ and the noise appears to be symmetric about $$\overline{x}$$ and $$\overline{y}$$. We therefore expect that ordinary least squares (OLS) will be reasonably robust to the heavy-tail in these cases. For the last three datasets though, we expect OLS to overfit given the asymmetry in the noise. Iteratively Reweighted Least Squares (IRLS) A naive approach to combat the heavy-tail is to weigh the contribution of each data point based on its residual. This provides a way to reduce the influence of high-variance data points in the fitting process. Implicit in the above statement is that we know the residuals, $$\epsilon_{i}$$, meaning we know $$\overline{y} \vert x$$. However, we do not know this a priori, and therefore must use an iterative process to converge on a solution. This approach is often called iteratively reweighted least squares, or IRLS for short. In IRLS we start by defining a $$n \times n$$ diagonal matrix, $$W$$, of weights. Note that we are now using a non-parametric model, since the number of independent fitting parameters equals $$n^{2}$$ + 2. $$W$$ will be defined by our choice of loss function, $$\ell(\epsilon)$$. $\ell(\epsilon) = W(Y - X\Theta) \tag{Eq. 3}\label{Eq. 3}$ Without assuming anything about the loss function, other than that it is once differentiable in $$\epsilon$$, it's fairly easy to show that, for a given $$W$$, $$\hat{\Theta}_{IRLS}$$ can be found analytically. $\begin{split} \frac{d\ell(\epsilon)}{d\Theta} = 0 &= \frac{d\ell_{H}(Y - X\Theta)}{d\Theta} \\\\\ &= W(Y - X\Theta)X \\\\\ &= X^{T}WY - X^{T}WX\Theta \\\\\ \hat{\Theta}_{IRLS} &= (X^{T}WX)^{-1}X^{T}WY \end{split} \tag{Eq. 4}\label{Eq. 4}$ This is the weighted version of of the OLS solution. Based on our definition for $$W$$, then, the analytic solution for $$\Theta_{IRLS}$$ dictates that $$w_{i} = d\ell(\epsilon_{i})/d \vert \epsilon_{i} \vert$$; $$voil\grave{a}$$, we have an optimal solution for $$\Theta$$ given $$W$$, and vice versa. As for our choice of loss function, we're looking for a rule that minimizes the effect of the heavy-tail. One such rule is the Huber loss (Eq. 5), $$\ell_{H}$$, which is a hybrid L1/L2 loss function that defines a threshold value, $$k$$, for $$\vert \epsilon \vert$$, below which $$\ell_{H}$$ takes the form of the L2 loss, and above which $$\ell_{H}$$ behaves like the L1 loss. Based on our residual plot, an appropriate choice might be something like $$k = 5$$. $\ell_{H}(\epsilon) = \begin{cases} \frac{1}{2}e^{2} \qquad \qquad if \quad \vert e\vert \leq k \\\\\ k\vert e\vert - \frac{1}{2}k^{2} \quad otherwise\\ \end{cases} \tag{Eq. 5}\label{Eq. 5}$ The quantity $$d \ell(\epsilon_{i})/d \vert \epsilon_{i} \vert$$ is equal to 1 for $$\vert \epsilon \vert \leq k$$, and $$k / \vert \epsilon \vert$$ otherwise. At each timestep, $$t$$, we compute $$e_{i}^{t}$$ and $$W^{t}$$ from $$\Theta^{t-1}$$. $$\hat{\Theta}^{t}$$ is then computed using Equation 4. This processes is repeated to convergence. My Python implementation of this iterative process is shown below. def run_IRLS(steps, k): print('Finding IRLS estimates ...') for step in range(steps): for dataset in files: # Load b, x, y x = data[dataset]['x'] x = np.stack((np.ones(shape=(len(x))), x)) y = data[dataset]['y'].reshape(1, -1) b = data[dataset]['b_ols'][-1] # Calculate dataset residuals resids = np.abs(y - b.T.dot(x)).flatten() # Calculate Huber-loss, construct weight-matrix (w) loss = 0 n_ = len(resids) w = np.zeros(shape=(n_, n_)) for i in range(n_): if resids[i] > k: loss += 0.5 * resids[i] ** 2 w[i, i] = k / resids[i] else: loss += k * resids[i] - 0.5 * k ** 2 w[i, i] = 1. data[dataset]['loss_irls'].append(loss / float(n_)) # Update model parameters b = la.inv(x.dot(w.dot(x.T))).dot(x.dot(w.dot(y.T))) data[dataset]['b_irls'].append(b) # OLS residuals resids = np.array(()) for dataset in files: x = data[dataset]['x'] x = np.stack((np.ones(shape=(len(x))), x)) y = data[dataset]['y'].reshape(1, -1) b = data[dataset]['b_ols'][-1] resids = np.append(resids, y - b.T.dot(x)) return resids  The fits from this model are shown in Fig.2 at the end of the post (green regression line). Compared to the OLS fit, the IRLS fit appears more robust to the heavy tail in datasets 1_3, 1_4, and 1_5, which is consistent with the asymmetric noise that we see visually. As expected, the IRLS fits in datasets 1_1 and 1_2 or nearly identical to OLS. By definition, IRLS fits to the observed noise. Not only is it non-parametric, but the contribution of each residual to the loss function is (essentially) being set according to our semi-arbitrary choice of loss function and any associated hyperparameters (in our case $$k$$). If we were to take a second draw of n samples from each of the five distributions that generated these data, would IRLS give a similar result? Intuitively, there are many scenarios in which they could be completely different, especially when the sample size, $$n$$, is small. This is not a desirable property for predictive modeling. In practical, real world problems, we are given finite sample sizes, and the goal is to use the finite number of observations that we have to develop a model that generalizes to the distribution from which they were generated, not to the observations themselves. Maximum Likelihood estimation (MLE) We would like to represent $$\epsilon(x)$$ explicitly in our model given that the residuals seem to be correlated with $$x$$. To do this, we can reframe the problem of linear regression by considering the distribution from which data is sampled using maximum likelihood estimation, or MLE. We first define a likelihood function, $$\mathcal{L}(\Theta \vert X, Y)$$, that is equal to the probability of the observed data given a particular point estimate $$\Theta$$. The goal is to maximize this probability, or equivalently, minimize the negative of its natural logarithm. This is shown below in Eq. 6. $\begin{split} \mathcal{L}(\Theta \vert X, Y) &= p(Y, X \vert \Theta) \\\\\ &= \left(\prod_{i=0}^{n} p(y_{i} \vert x_{i}; \Theta) \right)^{\frac{1}{n}} \\\\\ &= \frac{1}{n} \sum_{i=0}^{n} ln(p(y_{i} \vert x_{i}; \Theta)) \\\\\ \hat{\Theta}_{MLE} &= \ \underset{\Theta}{\mathrm{argmin}} \frac{1}{n} \sum_{i=0}^{n} -ln(p(y_{i} \vert x_{i}; \Theta)) \end{split} \tag{Eq. 6}\label{Eq. 6}$ Distributions from the exponential family work nicely with MLE. For example, if we assume that the observed noise is normally distributed, the solution ends up reducing to ordinary least squares. If we place a Laplacian over the data, then minimizing the negative log-likelihood function is the same as minimizing a L1 loss. The benefit of MLE, though, is that the variance, $$\sigma$$, is represented explicitly. In our particular case the variance is not normally distributed, but there is a good chance that it is approximately normal conditioned on $$x$$. If the variance is constant then it ends up having no effect on the point estimates, however if we were to place some smooth function on it, say quadratic in $$x$$, now the the loss function depends on the variance. It might not be clear that this will lead to a more robust estimator, and indeed it depends on the distribution that we place on the data, but in the Gaussian and Laplacian cases it does reduce the effect of large residuals. I will first consider the Gaussian case. I start by making the variance, $$\sigma^{2}$$, quadratic in $$x$$. This choice is base purely on visual inspection of the residual plot, but seems fairly reasonable. $\begin{split} \sigma^{2}(x) &= e^{\lambda^{T}Z} \\\\\ \lambda &= \begin{bmatrix} \lambda_{0} \\ \lambda_{1} \end{bmatrix} \\\\\ Z &= \begin{bmatrix} 1 & 1 & . . . & 1 \\ x_{1}^{2} & x_{2}^{2} & . . . & x_{n}^{2} \\ \end{bmatrix} \\ \end{split} \tag{Eq. 7}\label{Eq. 7}$ Placing a Gaussian distribution on the likelihood, $$p(Y, X \vert \Theta)$$, and then taking the negative of its logarithm yields a new objective function, $$\ell_{G}$$, that has the familiar L2 norm term in addition to a new $$\sigma^{2}(x)$$ term parametrized by $$\lambda$$. This is shown in Eq. 8. $\begin{split} p_{G}(Y, X \vert \Theta) &\sim N(Y - B^{T}X, \sigma^{2}(x)) \\\\\ &= \frac{1}{\sqrt{2\pi \sigma^{2}(x)}} e^{\frac{(Y - B^{T}X)^{2}}{2\sigma^{2}(x)}} \\\\\ -ln(p_{G}(Y, X \vert \Theta)) &\propto ln(\sigma^{2}(x)) + \sigma^{-2}(x)(Y - B^{T}X)^{2} \\\\\ &= \lambda^{T}Z + e^{-\lambda^{T}Z}(Y - B^{T}X)^{2} \\\\\ &= \ell_{G}(a, b, \lambda_{0}, \lambda_{1}) \\\\\ &= \frac{1}{n} \sum_{i=1}^{n} \lambda_{0} + \lambda_{1}x_{1}^{2} + e^{-\lambda_{0} - \lambda_{1}x_{i}^{2}} (y_{i} - a - bx_{i})^{2} \end{split} \tag{Eq. 8}\label{Eq. 8}$ In these equations $$B$$ contains the slope ($$b$$) and intercept ($$a$$) terms, $$\Theta$$ is the model parameters ($$B, \lambda$$), $$Y$$ is a row vector of responses, $$y_{i}$$, and $$X$$ is a $$2 \times n$$ matrix with ones in its zeroth row and $$x_{i}$$'s in its first row. To optimize this objective function with respect to $$\lambda_{0}$$, $$\lambda_{1}$$, $$a$$, and $$b$$, I use gradient descent with a vanilla update rule: $$\Theta^{t} \leftarrow \Theta^{t-1} - \eta \ \nabla_{\Theta} \ell(\Theta)$$, where $$\eta$$ = $$9.5 \times 10^{-5}$$ (the learning rate). The resulting fits are shown in Fig. 2 (blue regression lines). The gradients used to update $$\Theta$$ at each training step are given in Eq. 9. $\begin{split} \nabla_{\lambda_{0}} \ell_{G} &= 1 - \frac{1}{n}\sum_{i=1}^{n}e^{-\lambda_{0} - \lambda_{1}x_{i}^{2}}(y_{i} - a - bx_{i}) \\\\\ \nabla_{\lambda_{1}} \ell_{G} &= \frac{1}{n} \sum_{i=1}^{n} x_{i}^{2}(1 - e^{-\lambda_{0} - \lambda_{1}x_{i}^{2}}(y_{i} - a - bx_{i})) \\\\\ \nabla_{a} \ell_{G} &= -\frac{2}{n} \sum_{i=1}^{n} e^{-\lambda_{0} - \lambda_{1}x_{i}^{2}}(y_{i} - a - bx_{i}) \\\\\ \nabla_{b} \ell_{G} &= -\frac{2}{n} \sum_{i=1}^{n} x_{i} e^{-\lambda_{0} - \lambda_{1}x_{i}^{2}}(y_{i} - a - bx_{i}) \\ \end{split} \tag{Eq. 9}\label{Eq. 9}$ def neg_log_likelihood_gaussian(b, lmda, x, y): l = 0 n_ = len(x) for i in range(n_): log_var_term = lmda[0] + lmda[1] * x[i] ** 2 var_term = np.exp(-lmda[0] - lmda[1] * x[i] ** 2) l2_term = (y[i] - b[0] - b[1] * x[i]) ** 2 l += log_var_term + var_term * l2_term return l def b_update_gaussian(b, lmda, x, y, learning_rate): b_0_grad, b_1_grad = 0, 0 n_ = len(x) for i in range(n_): var_term = np.exp(-lmda[0] - lmda[1] * x[i] ** 2) resid_term = y[i] - b[0] - b[1] * x[i] b_0_grad += var_term * resid_term b_1_grad += x[i] * var_term * resid_term b_0_grad *= -2 * n_ ** -1 b_1_grad *= -2 * n_ ** -1 b_0 = b[0] - learning_rate * b_0_grad b_1 = b[1] - learning_rate * b_1_grad return np.array([b_0, b_1]) def lmda_update_gaussian(b, lmda, x, y, learning_rate): lmda_0_grad, lmda_1_grad = 0, 0 n_ = len(x) for i in range(n_): var_term = np.exp(-lmda[0] - lmda[1] * x[i] ** 2) resid_term = (y[i] - b[0] - b[1] * x[i]) ** 2 lmda_0_grad += var_term * resid_term lmda_1_grad += x[i] ** 2 * (1 - var_term * resid_term) lmda_0_grad = 1 - n_ ** -1 * lmda_0_grad lmda_1_grad *= n_ ** -1 lmda_0 = lmda[0] - learning_rate * lmda_0_grad lmda_1 = lmda[1] - learning_rate * lmda_1_grad return np.array([lmda_0, lmda_1]) def train_gaussian(num_steps, learning_rate): print('Finding Gaussian MLE estimate ...') for step in range(num_steps): # Compute loss l_step = 0 for dataset in files: # Extract current parameters from step - 1 b, lmda = data[dataset]['b_gauss'][-1], data[dataset]['lmda_gauss'][-1] x, y = data[dataset]['x'], data[dataset]['y'] # Parameter updates data[dataset]['b_gauss'].append(b_update_gaussian(b, lmda, x, y, learning_rate)) data[dataset]['lmda_gauss'].append(lmda_update_gaussian(b, lmda, x, y, learning_rate)) # Compute loss if step % print_interval == 0: l = neg_log_likelihood_gaussian(b, lmda, x, y) l_step += l data[dataset]['loss_gauss'].append(l_step) # Save loss if step % print_interval == 0: loss_gauss.append(l_step / float(n)) print('Loss at step %s: %.4f' % (step, loss_gauss[-1]))  Given that the Gaussian MLE approach still uses the L2 term in its loss function, we might expect a similar model with a L1 loss term to be influenced even less by the heavy-tail. As I mentioned previously, this can be achieved using a Laplacian distribution. Using the same process as above, but with the loss function and gradients shown in Eqs. 10 and 11, I obtained the red regression lines shown in the figure 2. I found that a slightly higher learning rate of $$\eta=2.85 \times 10^{-4}$$ was optimal for the L1 loss, which makes sense given that we expect the rate of convergence with L1 to be lower than with L2. $\begin{split} p_{L}(Y, X \vert \Theta) &\sim L(Y - B^{T}X, \sigma(x)) \\\\\ &= \frac{1}{\sigma(x)}e^{\frac{\vert Y - B^{T}X\vert}{\sigma(x)}} \\\\\ -ln(p_{L}(Y, X \vert \Theta)) &\propto ln(\sigma(x)) + \sigma^{-1}(x)\vert Y - B^{T}X\vert \\\\\ &= \frac{1}{2}\lambda^{T}Z + e^{-\frac{1}{2}\lambda^{T}Z}\vert Y - B^{T}X\vert \\\\\ &= \ell_{L}(a, b, \lambda_{0}, \lambda_{1}) \\\\\ &= \sum_{i=1}^{n} \frac{1}{2}(\lambda_{0} + \lambda_{1}x_{1}^{2}) + e^{-\frac{1}{2}(-\lambda_{0} - \lambda_{1}x_{i}^{2})} \vert y_{i} - a - bx_{i} \vert \end{split} \tag{Eq. 10}\label{Eq. 10}$ The gradients, $$\nabla_{\Theta}\ell_{L}$$, are shown below: $\begin{split} \nabla_{\lambda_{0}} \ell_{L} &= \frac{1}{2} - \frac{1}{2n}\sum_{i=1}^{n}e^{-\frac{1}{2}(\lambda_{0} - \lambda_{1}x_{i}^{2})}\vert y_{i} - a - bx_{i}\vert \\\\\ \nabla_{\lambda_{1}} \ell_{L} &= \frac{1}{2n} \sum_{i=1}^{n} x_{i}^{2}(1 - e^{-\frac{1}{2}(\lambda_{0} - \lambda_{1}x_{i}^{2})} \vert y_{i} - a - bx_{i}\vert) \\\\\ \nabla_{a} \ell_{L} &= -\frac{1}{n} \sum_{i=1}^{n} sgn(y_{i} - a - bx_{i}) e^{-\frac{1}{2}(\lambda_{0} - \lambda_{1}x_{i}^{2})} \\\\\ \nabla_{b} \ell_{L} &= -\frac{1}{n} \sum_{i=1}^{n} sgn(y_{i} - a - bx_{i}) x_{i} e^{-\frac{1}{2}(\lambda_{0} - \lambda_{1}x_{i}^{2})}. \\ \end{split} \tag{Eq. 11}\label{Eq. 11}$ The OLS, IRLS, MLE_Guassian, and MLE_Laplacian fits for all five datasets are shown below in Fig. 2. As we can see, the MLE estimates are far more robust to the noise than the least squares approaches. As a check to see whether the MLE models overfit, I also ran 5-fold cross validation on the Laplacian model; the training and validation error are plotted in Fig. 3, and exhibit a fairly monotonic decrease over $$10^{5}$$ gradient descent steps, indicated that these models generalize well. Figure 2. Linear regression fits for each of the five datasets. Figure 3. Mean training and test error from 5-fold cross validation. Final thoughts Ordinary least squares, weighted least squares, and MLE are all common approaches to robust linear regression. In our toy example, MLE offered the most flexibility as it gave us a natural way to introduce a custom variance term into the objective function. In the future I hope to add a section on Bayesian linear regression, in which we model $$p(\Theta \vert Data)$$ via $$p(Data \vert \Theta)p(\Theta)$$ using Bayes' theorem along with optimization methods based Markov-chain Monte Carlo (MCMC) sampling. If you are interested in using any of the code that I wrote for this problem, you can find it on my github page.