Theory: Inputs to Forecast Quality
How I categorize interventions that could make forecasts better
This is a Theory post, one of many post categories found on this blog. To see what to expect from posts in each category, along with categorized lists of past posts, check out the Table of Posts. As always, feedback or collaboration of any kind is greatly appreciated either here in the comments or privately on Twitter (@Damien_Laird).
This post is going to pull from both the research I’ve reviewed on prediction polling, and my own experiences and intuitions from participating in a forecasting tournament that I’ll be posting more about soon.
The categories in this flow chart have fuzzy boundaries, but I feel confident that I’ve drawn them in useful places. It’s intended to highlight the overarching areas in the process of generating a forecast where we have an opportunity to improve its quality. This might mean its accuracy, but it could also be the comprehensibility of its rationale, or something else I haven’t thought of yet.
Forecaster
The person putting the forecast together. Research has already found factors like fluid intelligence or open-mindedness, that likely can’t be easily changed with interventions, to be significant predictors of forecaster accuracy in geopolitical forecasting tournaments.
This leads me to believe that fully “optimizing” forecast generation would include recruiting particular forecasters. However, past forecasting performance has been found to be an even stronger predictor of accuracy than the kinds of attributes I listed above, though it’s not clear yet how well this transfers between domains or kinds of forecasts. Having a pipeline of forecasters competing against each other in the relevant domain so that you can recruit the top performers for the most important projects seems ideal, and in fact, this is where the concept of “superforecasters” came from.
Development
The boundary between development and training is particularly blurry, but I’m using development to refer to long term interventions that aren’t feasible to implement at the start of a project.
Does a particular college degree or professional background make someone more adept at forecasting in a particular domain? Can we create forecasting curriculums that go far deeper than 1 hour trainings and elicit more profound performance gains?
Do aspects of childhood development influence the factors that we see predict forecasting ability? I know this sounds unlikely to be applicable for current research interventions, but I’m trying to cast the widest net I can with this post so that I can later drill down into specific opportunities that I think are feasible.
Training
These are the interventions that are possible right at the beginning of a project. With a modest budget of time and energy, what kind of performance gains can we achieve?
A mostly blind guess at what might constitute useful training for the Good Judgement Project (GJP) induced a roughly 10% accuracy boost and seemed to accelerate performance gains from continual practice. Can we create a feedback loop between top forecasters and training programs to focus material on what’s most useful, and then empirically confirm that it works? Can we do trainings on particular domains so that generalist forecasters stand a better chance of finding useful information when they research a particular question?
Collaboration
Similar to training, the GJP achieved significant accuracy gains from allowing forecasters to collaborate together on teams. Though their were some efforts to mitigate groupthink and diffusion of responsibility, like each team member creating independent forecasts, these efforts were a first step and largely unoptimized.
Can the use of higher bandwidth communication environments (vs. comment boxes after forecast rationales) improve deliberations? What about the highest bandwidth medium of all, in-person collaboration?
Can we structure teams differently, so that not every member is an independent forecaster? Maybe dedicated researchers that facilitate the inquiries of forecasters, or domain experts that can act as sounding boards, improve performance over the relatively flat team structures that I’ve seen so far.
Investigation
Independent of their forecasting skill, a forecaster has to actually find relevant information in order to incorporate it into their reasoning. This may have been a minor concern in the GJP, where the most relevant information for near term geopolitical questions were news articles that (I strongly suspect) mostly reported the same information from a single source and could be aggregated by key word searches. My intuition is that it’s a much more significant bottleneck when attempting to forecast Global Catastrophic Risks (GCRs). My anecdotal experience has been that these topics are so niche that lots of highly relevant information is buried in very specialized sources that are challenging for a layman to understand or even find in the first place, especially under the time and effort constraints of most forecasting tournaments. How can we help forecasters find the information they need for higher quality forecasts?
Information
The information that forecasters need to find may not even be documented anywhere. We’re still discovering what attributes and types of information set accurate forecasts apart from the rest, so it would be surprising if the existing incentives for knowledge creation happened to align with what forecasters most need.
An obvious example in the context of GCRs are historical base rates. Can we incentivize the creation of high quality accountings of past pandemics or volcanic eruptions or nuclear close calls? They would have to be suitable for a layman to not just comprehend, but understand deep enough to be able to decide how to vary those historical base rates based on differences in context between then and now. My sense is that these sorts of inventories of past events do often exist, but not at all in this form or created from this perspective.
Forecast
The form a forecast takes can have a significant influence on not just its accuracy but also its relevance for different kinds of decision making. Is it just a percent likelihood or is there an associated rationale? Are there confidence intervals, is it a probability distribution, or is it a bet on a prediction market? Which of these decisions should be left to the individual forecaster vs. enforced by a platform/project to ensure comparability between forecasts? Are there “lighter touch” interventions that are useful, like providing example templates to choose from?
Aggregation
Prediction markets are already a kind of aggregation, but there are multiple ways to benefit from the “wisdom of crowds” phenomenon. The aggregation algorithms used in GJP were the most significant factor in improving the accuracy of their final forecasts over those of their competitors, far beyond what would be possible from simple averaging. They weighted forecasts based on things like past forecaster performance and ongoing research is still experimenting with what methods of aggregation are most impactful. I’m sure forecast type and format greatly affect what methods of aggregation are even feasible, so this domain is ripe for continued improvement as the rest of the field evolves.
"Can we incentivize the creation of high quality accountings of past pandemics or volcanic eruptions or nuclear close calls? They would have to be suitable for a layman to not just comprehend, but understand deep enough to be able to decide how to vary those historical base rates based on differences in context between then and now."
I don't know if this fits what you had in mind here, Damien, but the Union of Concerned Scientists is one organization with both cause-based and financial incentives to create and maintain accounts of nuclear weapons "close calls." Here's their 2015 document summarizing – and even characterizing by context – a fair number of those:
https://www.ucsusa.org/sites/default/files/attach/2015/04/Close%20Calls%20with%20Nuclear%20Weapons.pdf
As well, the Global Volcanism Program of the Smithsonian Institution's National Museum of Natural History maintains "a catalog of Holocene and Pleistocene volcanoes, and eruptions from the past 12,000 years":
https://volcano.si.edu/
(I haven't tried to search the database, and thus don't yet know how easy it may be to filter on criteria of particular relevance to forecasting work.)
One of their FAQ entries addresses the question, "What volcanoes have the most people living nearby?" While, AIUI, truly catastrophic impacts from such eruptions might involve sudden drops in global temperatures, induced droughts, and the like, the presence of large nearby populations might sometimes also be relevant?
https://volcano.si.edu/faq/index.cfm?question=population