Application: Improving Internal Forecasting at Open Philanthropy
A hypothetical case study in applying forecasting
This is an Application post, one of many post categories found on this blog. To see what to expect from posts in each category, along with categorized lists of past posts, check out the Table of Posts. As always, feedback or collaboration of any kind is greatly appreciated either here in the comments or privately on Twitter (@Damien_Laird).
Summary
I believe with high confidence [>90%] that training, tighter accuracy feedback loops, allowing multiple people to forecast each question, and aggregating the forecasts of multiple people would all greatly improve the accuracy of Open Philanthropy’s internal forecasting, and in turn the capabilities of its employees. I believe with moderate confidence [>50%] that these interventions are feasible for them to implement, with the list above ordered from most feasible to least feasible. An employee at Open Philanthropy should be able to evaluate my proposals and arrive at much higher confidence levels on feasibility, further detail them to fit into the existing organization, and estimate their cost of implementation.
The evidence backing these assertions is addressed in the following section.
Introduction
Open Philanthropy, a grantmaking organization central to the Effective Altruism community, seems interested in improving their internal forecasting. They incorporate forecasting into their grantmaking process and have shared a lot of information about it publicly. Coincidentally, they also fund some forecasting research and related organizations, so improving their forecasting savvy would likely have ripple effects throughout the network.
I am basing my understanding of their forecasting system and results to date based on their public writeup here. All of my understanding of the grantmaking process that it’s largely embedded within come from their publicly documented process, write-up template, and classification scheme. Overall, I expect these have given me a pretty good view of the processes that their current forecasting practices are embedded in, while I’m still largely in the dark on the culture and resource constraints surrounding their forecasting efforts.
Of course, any actual improvements would have to be made by someone internal to the organization. They would have the benefit of the full context of all the aspects of these processes that can’t be captured in writing, understanding the organizational resources available, and the ability to assess the existing internal culture and perceptions around forecasting. These would absolutely be pre-requisites to making any meaningful, positive changes but after reviewing the above documents I feel pretty confident in laying out a menu of options that such a person could draw from to improve their internal forecasting [>90% confident each intervention would improve their internal forecasting, >50% that each would be feasible for them to implement]. There seems to be plenty of both room for improvement and freedom of design that I expect significant progress to be possible.
To very briefly summarize the current state for readers, when someone at Open Philanthropy authors a write-up to accompany a grant (a typical practice) there is an optional section for them to include forecasts with particular percentage estimates and resolution dates. As these writeups are reviewed at later dates, these forecasts are evaluated and logged in a database where they can be aggregated across the organization or by individual forecasters to evaluate accuracy.
The following sections will be possible interventions to improve on this current state, ordered very roughly from what I expect to be least costly to most costly, though obviously a much better estimate of both costs and benefits would be possible internally. The research referenced throughout can be found here, discussed in more detail and with citations. My reading of that research, plus my own intuitions from participating in this tournament, together give me the above listed >90% confidence in the efficacy of the following interventions.
Training
I didn’t find any mention of forecasting training in the documents I reviewed. The Good Judgement Project (GJP) has found significant improvements (~10%) in forecasting accuracy sustained for about a year following only an hour of training! The training they used was based mainly on a sort of best guess of what was likely to be useful and later some input from their top forecasters. I’m sure an even more effective training could be constructed by either the top internal forecasters (the organization already has the data to identify them), or proven external forecasters that Open Philanthropy likely already has some sort of relationship with. For the lowest hanging fruit, you could just use the same training from the GJP, which was published alongside their research.
This is far more speculative, but I’d also expect the de-biasing that has been the target of forecasting training thus far to also have more general positive impacts in an organization so focused on trying to make rational bets.
Feedback
Reading through Open Philanthropy’s processes made evaluation of the forecasts seem like a bit of an afterthought, that sometimes doesn’t happen at all. This seems like an oversight that’s missing some of the indirect benefits. While the grantmaking decision has likely already been made, the feedback loop between a forecaster and their results is what enables forecasters to improve over time. This improvement has been consistently demonstrated in tight feedback loop environments like the GJP. I would recommend breaking forecasting resolution out of the current grantmaking process pipeline somehow to enable this, ideally requiring all forecasts to have a clear resolution date (this is already recommended) and then having those act as a trigger in a separate process to evaluate accuracy.
The somewhat scarce data on Open Philanthropy’s internal forecasting shows an interesting pattern. Forecasters are actually overconfident in their most extreme forecasts. This is the opposite of what tends to be found in the research! This leads me to believe that both training and a tighter feedback loop could yield significant improvements in accuracy by reducing this bias. The GJP found that participating in a forecasting tournament decreased participant confidence in domestic US policy positions, despite this being unrelated to the topics being forecasted. Their hypothesis for this was that the (somewhat brutal) environment of tight and continuous feedback on their forecasts cultivated more general epistemic humility. Again, this seems likely to be a broadly desirable trait in this organization!
Consensus
The absolute largest gains in forecasting accuracy (~35% in the GJP with a fancy algorithm) come from aggregating the forecasts of many individuals. I expect getting multiple individuals (the more the better) to forecast each prediction to be the largest single intervention possible to improve accuracy, it’s just a bit down this list because I recognize it represents a significant deviation from the current processes and would require more time from more individuals. Designing this new forecasting process well also requires a lot more detailed knowledge than what I have access to.
A hypothetical process could involve breaking the predictions out from the rest of the write-up early in the process (maybe ~2 weeks before the write-up is needed?) where they could be shared with the entire organization along with whatever supporting information is required to make sense of them. Anyone could submit predictions for each question while it was open, as well as use a dedicated collaboration space to share related resources or debate assumptions and methods. This would likely require some incentive, but a leaderboard based on accuracy might be sufficient. That being said, cash bonuses could be used to incentivize not only individual accuracy but also improving organization wide-accuracy with collaboration. This introduces the obvious possible confounder of participants crafting their own forecasting questions, but maybe the existing incentives for convincing grant write-ups already prevent this?
The consensus process could then end when the grant write-up is needed, with the crowd consensus forecast being used vs. the prediction from just one individual. This would also help to keep the number of forecasting questions open at any one time to a minimum.
It’s impossible for me to know what level of participation you could expect in such an environment, and I’m sure it would depend on the topic being forecasted. That being said, even a few participants per question should significantly increase accuracy and would have the indirect benefit of breaking down silos between focus areas and worldviews by exposing all of the organization to all of the relevant predictions.
An alternative approach to achieving consensus on forecasts could be to make forecasts public on a platform like Metaculus or Manifold, but I expect this to be seriously complicated by the confidential nature of granter/grantee interactions.
Aggregation
This intervention is last because it is only possible if the (admittedly costly) consensus enabling intervention described above is already implemented. However, if it was, you could then analyze past data to select an algorithm that generates a much more accurate consensus forecast than a simple median. You could immediately do things like extremization and weighting forecasts based on their forecaster’s past accuracy, but by collecting slightly more data you could also factor in the forecaster’s confidence in their prediction and the topic of forecasting. I expect the actual aggregation side of this (vs. the R&D side of developing the algorithm) to be easily automated with minimal overhead cost.
Conclusion
Not only would I expect these interventions to improve internal forecasting at Open Philanthropy, but they could also advance the state of the art in forecasting methods. Large, high-quality datasets seem scarce in the forecasting research world, and Open Philanthropy could potentially produce a large quantity of forecasts by dedicated and capable individuals all regarding the exact topics that they’re interested in. If the interventions were performed and documented well, I’d personally forecast a significant accelerating impact on a still fledgling field.