Combining Forecasts (Part 2)

Author

Jacob Steinhardt, Meena Jagadeesan

We’ll continue to explore ways to combining multiple forecasts.

10.1 Weighting Experts

Last time, we discussed the weighted mean as a candidate approach to combine point estimates. This approach requires figuring out weights for each expert. But how do we come up with these weights?

As a motivating example, let’s consider the forecasting question from a past homework:

On February 1st at 2pm EST, the U.S. Federal Reserve (the Fed) is planning to announce interest rates. Note that the federal interest rate is a range (e.g. 5% − 5.25%). What will be the upper value of this range?

When I was come with a forecast for this question, I consulted several sources, summarized below.

“We are changing our call for the February FOMC meeting from a 50 [basis point] hike to a 25bp hike, although we think markets should continue to place some probability on a larger-sized hike.” (source, Jan 18)

“Pricing Wednesday morning pointed to a 94.3% probability of a 0.25 percentage point hike at the central bank’s two-day meeting that concludes Feb. 1, according to CME Group data.” (source, Jan 18)

“Markets expect the Fed to raise rates again on February 1, 2023, probably by 0.25 percentage points…. However, there’s a reasonable chance the Fed opts for a larger 0.5 percentage point hike.” (source, Jan 2)

Tip

Before continuing to read, think through how you would weight these sources.

One way to combine up with weights for experts is to look at past track record. We could even improve this by tracking what type of questions they are good at.

We can formalize this mathematically by conceptualizing each forecast is as a noisy estimate of the truth, where different forecasts have different noise levels. Given this interpretation, to assign each forecast a weight, what we really want to do is to convert noise levels into weights.

10.1.1 Converting Noise Levels into Weights

We’ll describe a mathematical approach to convert noise levels into weights. Although this approach is hard to literally use in practice, it has good conceptual motivation. In fact, this approach is “optimal” under some assumptions (though these assumptions may not always be true). More specifically, if estimates are unbiased and independent, and estimate i has standard deviation σi\sigma_i, then we’ll show that the weights should be proportional to 1/σi21/\sigma_i^2.

Assume there is some “true” quantity we are trying to estimate (e.g. how the forecast will actually resolve), which we denote μ\mu. We have several noisy “signals” about this quantity, which we’ll denote X1,,XnX_1, \ldots, X_n. Think of these as the estimates we get by consulting different reference classes. Model each XiX_i as a random variable, and assume that it has mean μ\mu (it is in expectation an accurate signal) and standard deviation σi\sigma_i (it has some error). In other words: E[Xi]=μ,Var[Xi]=σi2 \mathbb{E}[X_i] = \mu, \text{Var}[X_i] = \sigma_i^2

Our goal is to combine these signals together in the best way possible. Suppose we decide to combine them by taking some weighted sum, iwiXi\sum_i w_i X_i. Then, we need to have wi=1\sum w_i = 1 (so that the weighted sum also has expectation μ\mu). The best weights will be the ones that minimize the variance of the sum.

First, assume the XiX_i are independent (i.e. each reference class has independent errors relative to the true estimate). We can compute the variance as

Var[i=1nwiXi]=i=1nwi2Var[Xi]=i=1nwi2σi2.\text{Var}[\sum_{i=1}^n w_i X_i] = \sum_{i=1}^n w_i^2 \text{Var}[X_i] = \sum_{i=1}^n w_i^2 \sigma_i^2.

Our goal is to pick weights to minimize this variance subject to the constraint that i=1nwi=1\sum_{i=1}^n w_i = 1. We can show that

wi1σi2.w_i \propto \frac{1}{\sigma_i^2}.

Interpretation: A variance typically corresponds roughly to the 70%70\% confidence interval for a random variable. So if we have confidence intervals obtained from several difference “signals” of the same underlying ground truth, and we think that each signal makes independent errors, the weight for each signal should be proportional to the inverse square of its confidence interval. In practice, I think this is a bit too aggressive, mainly because most errors are not independent in practice. So I tend to take weights that are closer to 1σi\frac{1}{\sigma_i}, i.e. the inverse (not inverse square) of the confidence interval width. But I’m not confident this is correct–under some assumptions, dependent errors imply that we should place even more weight on smaller values of σi\sigma_i.

Generalization to dependent errors: If the errors are dependent, we should intuitively downweight estimates that are more correlated with others. We can formalize this intuition using a bit of linear algebra, although the resulting formula isn’t necessarily that illuminating. Suppose that we have the same setup as before with E[Xi]=μ\mathbb{E}[X_i] = \mu. Since the errors are now dependent, we need to consider the covariance between every pair of XiX_i, XjX_j, so let Sij=Cov(Xi,Xj)S_{ij} = \text{Cov}(X_i, X_j), in which case Sii=σi2S_ii = \sigma_i^2 is the variance of XiX_i. Then, the formula for the variance of iwiXi\sum_i w_iX_i is

wSw=i,j=1nwiwjSijw^{\top}Sw = \sum_{i,j=1}^n w_iw_j S_{ij}

Taking derivatives and setting to zero, we find that we should choose weights ww proportional to S11S^{-1}\mathbf{1}, which is the inverse-covariance matrix multiplied by the all-1s vector. Somewhat oddly, this can lead to negative weights in some cases.

10.2 Working in Teams: The Delphi Method

Beyond assigning weights to forecasts, another way to take advantage of multiple perspectives is to allow the forecasters to work together and share their reasoning with each other.

The Delphi method is where forecasters individually come up with predictions and reasoning and then provide predictions and reasoning to the group. Individuals update based on group forecast, potentially over multiple arounds. At the end, the group creates a forecast by taking an average of all of the final individual forecasts.

Intuitively, this approach benefits not only from averaging, but also discussion—you can improve your estimates of intermediate quantities or improve what your reference classes you consider by discussing with others.

There are several variants to the Delphi method. For example, predictions and reasoning can be provided anonymously. Or, only reasoning is shared with the group, not numerical predictions.

10.3 Combining Confidence Intervals

So far, we’ve focused on combining point estimates. But what if we want to combine confidence intervals? That is, if each forecaster gives us a confidence interval, how should we come up a confidence interval that combines all of their estimates?

The simplest approach is to take the mean (or the trimmed mean) of the upper and lower endpoints. However, this can be an underestimate (e.g., if the intervals are disjoint) or not end up being dominated by some large intervals (e.g. if some intervals are over different orders of magnitude than other intervals).

Here is a more sophisticated approach to combine intervals, that uses mixtures. Suppose we have confidence intervals [a1,b1],,[an,bn][a_1, b_1], \ldots, [a_n, b_n], and we want to combine these together. Conceptually, we can think of each confidence interval [ai,bi][a_i, b_i] as a probability distribution pip_i such that aia_i and bib_i are upper and lower percentiles of pip_i. We then want to make a new forecast based on averaging these distributions, to obtain pˉ=p1++pnn,\bar{p} = \frac{p_1 + \ldots + p_n}{n}, where the sum denotes the mixture of distributions. (As an example, if all of the distributions pip_i are equal to some distribution pp, then the average distribution pˉ\bar{p} is also equal to pp.)

Warning

Sampling from pˉ\bar{p} is not the same as taking the average sample from the nn distributions. Instead, a correct way to get a sample from pˉ\bar{p} is to sample uniformly from i{1,2,,n}i \sim \left\{1, 2, \ldots, n\right\} and then take a sample from pip_i.

We would like the upper and lower confidence intervals of this new distribution pˉ\bar{p}. How can we get that?

10.3.1 Basic Formulation of Mixture-Based Confidence Interval

First, let’s use the heuristic that being within one standard deviation corresponds to the 70% confidence interval of pip_i. So pip_i is a distribution with variance σi2\sigma_i^2 roughly proportional to (biai)2(b_i-a_i)^2. What, then, is the variance of pˉ\bar{p}?

It turns out the variance Var(pˉ)\text{Var}(\bar{p}) depends on the means of each pip_i as well, so suppose that each pip_i has mean μi\mu_i. (One way to approximate μi\mu_i is as (ai+bi)/2(a_i + b_i)/2, although we might want to use a different value if we think the distribution is left- or right-skewed.) Observe that pˉ\bar{p} has mean μˉ=μ1++μnn.\bar{\mu} = \frac{\mu_1 + \ldots + \mu_n}{n}. The variance is a bit trickier to compute, but can be calculated as the average of the variances plus the variance in means: Var(pˉ)=σ12+...+σn2n+(μ1μˉ)2++(μnμˉ)2n.\text{Var}(\bar{p}) = \frac{\sigma_1^2 + ...+ \sigma_n^2}{n} + \frac{(\mu_1 - \bar{\mu})^2 + \cdots + (\mu_n - \bar{\mu})^2}{n}.

Turning this back into confidence intervals, we want a confidence interval centered at μˉ\bar{\mu} with width given by σˉ\bar{\sigma}, That is: the combined confidence interval is

[μˉσˉ,μˉ+σˉ][\bar{\mu} - \bar{\sigma}, \bar{\mu} + \bar{\sigma}]

where

σˉ=σ12+...+σn2n+(μ1μˉ)2++(μnμˉ)2n.\bar{\sigma} = \sqrt{\frac{\sigma_1^2 + ...+ \sigma_n^2}{n} + \frac{(\mu_1 - \bar{\mu})^2 + \cdots + (\mu_n - \bar{\mu})^2}{n}}.

To unpack this formula, let’s consider several examples. The simplest case is where all of the confidence intervals are equal to [a,b][a, b]: in this case, the combined confidence interval would also be equal to [a,b][a, b]. As another example, if all of the confidence intervals have the same mean but different widths, then the second term is equal to 00: this means the combined confidence interval width is equal to the quadratic mean of the widths of the original confidence intervals. If the means are different, then the second term serves as a penalty that increases the combined confidence interval width.

10.3.2 Adjustments to Mixture-Based Confidence Interval

The confidence interval above doesn’t account for two additional things we care about:

  • The width should depend on the actual interval probability we are using, e.g. 70% vs 80% vs 90%. The derivation above is mostly about the 70% interval due to the “70% = standard deviation” heuristic.
  • We want to account for skewness—if the original intervals are left- or right-skewed, we want our final one to be as well.

Here is a heuristic formula that has both of these properties as well as approximately the correct width: Let μ0\mu_0 be the overall mean estimate (or the median of the interval centers, or any other good guess of the center of the distribution). Choose the following confidence interval:

[μ0i=1n(μ0ai)2n,μ0+i=1n(biμ0)2n]\left[\mu_0 - \sqrt{\frac{\sum_{i=1}^n (\mu_0 - a_i)^2}{n}}, \mu_0 + \sqrt{\frac{\sum_{i=1}^n (b_i - \mu_0)^2}{n}}\right]

Why is this a reasonable estimate? This confidence interval is centered at μ0\mu_0 (the mean) and has lower width i=1n(μ0ai)2n\sqrt{\frac{\sum_{i=1}^n (\mu_0 - a_i)^2}{n}} and upper width i=1n(biμ0)2n\sqrt{\frac{\sum_{i=1}^n (b_i - \mu_0)^2}{n}}. If all of the confidence intervals are equal to [a,b][a,b] and μ0\mu_0 is the mean of the interval centers, then the proposed confidence interval would also be [a,b][a,b]. More generally, if μ0\mu_0 is the mean of all of the interval centers, then we can interpret 12(i=1n(μ0ai)2n+i=1n(biμ0)2n)\frac{1}{2} \left(\frac{\sum_{i=1}^n (\mu_0 - a_i)^2}{n} + \frac{\sum_{i=1}^n (b_i - \mu_0)^2}{n}\right) as the variance of the distribution given by sampling the interval endpoints aia_i and bib_i uniformly at random. The proposed confidence intervals essentially uses this as the width, but treats the lower and upper bounds separately.

10.4 Summary

Let’s end with a summary of what we’ve learned about combining forecasts from the past 2 lectures.

  • Averaging multiple approaches or experts often improves forecasts.

  • Assess track record and accuracy of sources to determine weights.

  • Consider working in teams and generating independent numbers.

  • Combining confidence intervals: several ideas, but there is no silver bullet (yet).


  1. Shared by an economist at Citigroup, the 3rd largest banking institution in the US↩︎

  2. Based on CME Group data. The CME Group is the world’s largest financial derivatives exchange. The CME FedWatch Tool uses futures pricing data (the 30-Day Fed Funds futures pricing data) to analyze the probabilities of changes to the Fed rate.↩︎

  3. From Simon Moore who is a writer at Forbes. He provides an outsourced Chief Investment Officer service to institutional investors. He has previously served as Chief Investment Officer at Moola and FutureAdvisor, both are consumer investment startups that were subsequently acquired by S&P 500 firms. He has published two books and is a CFA Charterholder and educated at Oxford and Northwestern.↩︎

  4. Recall that our goal is to minimize i=1nwi2σi2\sum_{i=1}^n w_i^2 \sigma_i^2 subject to the constraint i=1nwi=1\sum_{i=1}^n w_i = 1. To do this, we use Lagrange multipliers. The Lagrangian is: L(w1,,wn,λ)=i=1nwi2σi2λ(i=1nwi1).L(w_1, \ldots, w_n, \lambda) = \sum_{i=1}^n w_i^2 \sigma_i^2 - \lambda \left(\sum_{i=1}^n w_i - 1\right). The partial derivative of LL with respect to wiw_i is: Lwi=2σi2w1λ.\frac{\partial L}{\partial w_i} = 2 \sigma_i^2 w_1 - \lambda. Setting this equal to 00, we obtain: wi=λ2σi2w_i = \frac{\lambda}{2 \cdot \sigma_i^2} which proves that the weights are proportional to 1/σi21/\sigma_i^2.↩︎

  5. The variance Var(pˉ)\text{Var}(\bar{p}) is equal to 1ni=1nE[pi2](μˉ)2\frac{1}{n} \sum_{i=1}^n \mathbb{E}[p_i^2] - (\bar{\mu})^2, which is equal to 1ni=1n(σi2+μi2)(μˉ)2\frac{1}{n} \sum_{i=1}^n (\sigma_i^2 + \mu_i^2) - (\bar{\mu})^2, which is equal to σ12+...+σn2n+1ni=1nμi2(μˉ)2\frac{\sigma_1^2 + ...+ \sigma_n^2}{n} + \frac{1}{n} \sum_{i=1}^n \mu_i^2 - (\bar{\mu})^2. The second term is equal to the variance of the distribution given by uniformly choosing μ1,,μn\mu_1, \ldots, \mu_n at random and can be equivalently written as 1ni=1n(μiμˉ)2\frac{1}{n} \sum_{i=1}^n (\mu_i - \bar{\mu})^2.↩︎