Post-Mortem: 2022 Hybrid Forecasting-Persuasion Tournament
How this novel tournament worked and my thoughts on participating
Introduction
This post is looking back at a forecasting tournament that I participated in between June and November 2022. I’m writing this now in February 2023 and I wish I had done so closer to the tournament’s conclusion. I think I did a pretty good job documenting my work throughout the event, so hopefully my recollections are pretty accurate. Now that I have a place to publish things like this, I’ll be much more timely moving forward. As part of tournament rules we were not able to publicly share information about it until 4 months after it ended, so I’m posting this about a month after writing it.
My understanding is that post-mortems are a common practice in the forecasting world, but admittedly, I haven’t read many of them. For now, I’m claiming that’s intentional so that I can present this one in a way that I think is most valuable vs. mimicking what I’ve seen other people do, but maybe in the future I’ll do a meta post about post-mortems in general.
With this post, I’m endeavoring to…
Share what the tournament was like. Both mechanically and as an experience. I’m hoping this helps other people decide that they might enjoy participating in them, but at a minimum it should help people to better understand what they are.
Review my own performance. This should help me improve, but if other forecasters read this hopefully they improve as well. I care much more about improving the state of the art than having a competitive edge in these things.
Review the tournament format. This is actually the most critical section from my perspective, as my intuitions from participating in this tournament have informed a lot of my views on where forecasting global catstrophic risks (GCRs) can be most improved.
Discovery
I heard about this tournament on Astral Codex Ten. I believe Scott posted in an open thread that the organizers were looking for volunteers, and that they’d be compensated for their time. Being financially incentivized to learn more about GCRs and forecasting sounded like a good deal to me, so I applied. The tournament was described as being conducted by a team led by Phil Tetlock, an originator of the concept starting with the Good Judgement Project, and funded by Open Philanthropy, an Effective Altruism (EA) funding organization that is interested in mitigating global catastrophic risks among other cause areas.
The application process is what I remember least well about the whole tournament. Partially because it was so long ago, but also because I completed the entire application on my phone before getting out of bed one morning. This involved answering lots of questions about myself and my knowledge base around forecasting GCRs. It also involved actually forecasting something, I think it might have been COVID case trends, and explaining my rationale.
To prospective tournament participants, I would not recommend researching and forecasting via smartphone while barely conscious.
In my defense, I viewed my application as a bit of longshot since I didn’t have any real forecasting experience and this was my first time ever applying for a tournament. Fortunately, my application was sufficient for my participation. I have no way of knowing how restrictive the admissions actually were. Maybe the application process was long enough to dissuade enough people that they needed to accept most of those who applied.
Tournament Format
The rules were originally a Google Doc distributed to participants which I’ve saved here as a PDF for posterity. I think the tournament ended up deviating mildly from them, especially in terms of timing, and they may have updated this document over the course of the event. Regardless, I’d expect the PDF I linked to be ~95% representative of how things worked. If you want the nitty gritty details, check it out, but otherwise I’ll summarize here. Please note that because this was an experiment, my experience may not be representative of the whole tournament. For all I know I was in a strange test condition or on an outlier team. I found out after the tournament that I was brought in as part of the “subject-matter expert” group of volunteers. I did not receive any training from the tournament on how to forecast well, but I think I remember there being some tips on how to effectively collaborate as a team.
The tournament contained a lot of questions for me to forecast. 59 to be exact. The first 12 of which were required for everyone and directly focused on GCRs. An example was…
What is the probability that one or more incidents involving nuclear weapons will be the cause of death, within a 5-year period, for more than 10% of humans alive at the beginning of that period…
These questions needed to be forecasted on multiple time scales: typically 2030, 2050, and 2100.
The other 47 were largely optional, with people being assigned a few required ones from the total pool and being encouraged to forecast as many of the rest as they could. These were less directly connected to GCRs, but still related. An example was…
What is the probability that total worldwide production of primary crops will fall by at least 70% within any three-year period…
The timescales for these questions were more variable. All of the questions had an associated Google Doc that typically shared some relevant sources and clarified any resolution criteria that participants found confusing.
Participants had roughly two weeks to independently forecast questions with both a probability estimate and a written rationale explaining the forecast being made. Incentives for the tournament were designed so that participants would strive to forecast accurately, and explain their forecasts in a way that others found convincing and resulted in them updating their own beliefs to become more accurate. See the rules document I linked for scoring details (it uses reciprocal scoring), which is an obviously critical detail in a tournament with prediction time scales going out to 2100 and something that may need to be figured out for GCR forecasting in general.
Participants were then able to chat with their team. see their forecasts, and leave comments on them. We had about three weeks to interact and update our own forecasts and associated rationale based on these interactions.
Two teams then merged (my team of “subject-matter experts” apparently merged with a team of “superforecasters”) and had over a week to review each others forecasts and update our own once again.
This combined team now had to collaborate on creating a single rationale in the form of a “wiki” style page for each question that synthesized all of the shared forecasts. We were assigned questions to take the lead on as an “editor” and given advice on how to structure the wiki, but everyone across both teams was able to contribute to them.
The final phase was about two weeks long and paired my combined team with a different combined team so that we could review each others “wiki” style synthesized rationales and update our own independent forecasts if we so desired.
(Note: The described phase lengths were how they were outlined in the rules, but in practice I think every phase was extended to give participants more time to complete it and the overall tournament took ~1.5 months longer than originally outlined)
Preparation
Once I got into the tournament, read the rules, and saw the questions I decided I was massively underprepared for what was involved and likely an underdog. Instead of diving right into forecasting, I wanted to take some time to get ready and develop a strategy.
I opened up my free time as much as I could and asked friends for the most relevant reading recommendations for my predicament. Both top recommendations had been on my radar for a while anyway, and I ended up ripping through them in about a day each. I found them to be tremendously useful:
Superforecasting: The Art and Science of Prediction by Philip E. Tetlock and Dan Gardner, 2015
Who better to learn about forecasting from than the leader of the team hosting the tournament? This book summarizes findings from the Good Judgement Project, the original forecasting tournament. It covers the biases that trip forecasters up and the attributes that set “superforecasters” (the top 2% of forecasters in terms of accuracy) apart from the rest.
I would call this required reading for anyone who plans to participate in a tournament. I think I’ve covered a lot of the key practical takeaways for forecasters between this post and Research: Prediction Polling, but I’d still recommend reading it on top of that.
I was surprised at how much of its contents I had already absorbed indirectly from the “Rationality” blogosphere (and reading The Sequences), but having it all tied together so succinctly and connected directly to the context of the underlying research was invaluable.
The Precipice: Existential Risk and the Future of Humanity by Toby Ord, 2020
More focused on the subject matter of the tournament, this is the most recent, most relevant book on GCRs (even though it’s primarily focused on existential risks) that I’m aware of. The recent publication date is especially relevant for rapidly evolving risks like AI or engineered pandemics. Toby Ord, the author, is also a founding voice in the EA movement, making this the definitive source on understanding the “Longtermist” perspective common in that world.
This book does a great job of surveying the risk landscape and discussing critical factors. It does a frustratingly poor job, from the perspective of someone trying to make forecasts, of discussing the methodology behind its probabilistic claims. It was also commonly cited in the background evidence for questions asked during the tournament, validating my claims about its relevancy.
For this I would highly recommend the slightly more outdated Global Catastrophic Risks by Milan M. Ćirković and Nick Bostrom which I had fortunately already read before applying. After the tournament I also discovered Global Catastrophes and Trends by Vaclav Smil which is similarly excellent and was published around the same time.
The Tournament
Still feeling like an underdog, but at least having a handle on what we were doing, I thought about how to approach the task at hand.
The tournament took place on a custom website with a somewhat awkward UX. The first thing I did was copy all of the questions into my own Google Doc along with links to their supporting document so I could work on them in one place. As I completed my rationales there I would copy and paste my forecasts to the tournament platform.
I decided to prioritize my required questions first before thinking about which optional questions I would take on, if any. My interpretation of the rules made me want to focus on quality over quantity and I didn’t have a good sense for how long each one would take me. Some of the required questions seemed related, and some even had clear dependencies between them. For example, the chance of nuclear extinction is narrow case of the chance of nuclear catastrophe, and understanding the chance of nuclear extinction informs the overall risk of extinction. I laid out the required questions in a sequence that I believed would minimize the amount of times I had to go back and change rationales that I had already completed.
I think I spent an average of 1-2 hours per question and tried to stick to a schedule of answering at least a question a day because this would put me on pace to complete all of the required ones before the end of the independent phase. You can see my final forecasts with rationales in a separate post here. These are recorded from the very end of the tournament, but are pretty indicative of my general approach which stayed the same throughout the tournament. I encourage you to go check it out. This section of the Post-Mortem will probably make a lot more sense if you read through at least a couple of my rationales.
It’s difficult to articulate, but I basically tried to figure out a model for the probability of each question that involved the fewest assumptions, which I felt maximally confident in. I’d typically start with the “outside view” base rates if they were available. For example, how common are historic pandemics? Then I’d look to the “inside view” for trends or things that might change the base rate moving forward and try to incorporate them into my estimate. This is apparently forecasting 101 among people who regularly do this.
My target of this model was typically an annual chance that an event would occur, which I could then extrapolate out to the various time horizons specified. This allowed me to handle all three time horizons with a single model that I could put a lot more time into vs. developing three separate ones. Where I thought the annual rate would change over time, I accounted for that as well (engineered pandemics or AI), but typically there were competing uncertainties in the future (pandemic prevention tech vs. increased chance of zoonosis) that I felt comfortable roughly cancelling out to keep a constant annual chance.
In practice, this process only required some basic math that I would lay out in the rationale hoping that this would let teammates catch any errors in later stages. As I worked through more questions however I realized there were a lot of dependencies between my calculations. If I caught an error or found new evidence for nuclear risk, I’d have to go back and manually update my calculations in each affected question in sequence. After spending a painful amount of time on this in a couple different instances I realized I should just make a Google Sheet capturing my calculations and their dependencies and link it in my rationale. This was a huge boon to productivity and teammates later expressed appreciation for it.
In hindsight, I believe that my choices to focus on building models around annual rates and to use a spreadsheet to automatically capture dependencies between my models were the most important aspects of my tournament performance. You could imagine that if I was less comfortable with spreadsheets I might have had a lot less time to spend on forecasting because of all the manual updates, or I may have just abandoned the idea of interdependent models at all. I think the latter decision would have been a massive mistake, because it acted as a sort of automated enforcement of consistency across domains freeing me from having to worry about slipping into various biases.
Using this methodology I was able to get through all of my required questions, review them myself, and correct various errors before the first phase concluded. Obviously I wished I had more time, but I was pretty happy with the level of quality they were at. I knew I’d be building off of these rationales for the rest of the tournament, and I wanted to put in a lot of effort up front to give myself an edge.
Moving into the second phase, I got to see my teammate’s forecasts for the first time. I was excited to see what they had found that I’d missed and incorporate it into my forecasts. Instead, I went from thinking I was a massive underdog to believing I was well ahead of most of the pack. Some teammates clearly had relevant domain expertise that I did not, and there were some well explained insights and differences that I could draw from… but it also seemed clear that I had put in much, much more effort on all of my forecasts than the average teammate. I’m glad that we had so long to work on them in paranoid isolation, otherwise I probably would have regressed towards the mean just out of comfort.
In this phase I had fun asking questions on other forecasts and responding to questions on my own. We caught each other’s mathematical errors and questioned intuitions. One forecaster, who I thought was quite strong, tended towards a methodology very different from my own. He tended to build large models with lots of parameters to estimate that would then combine to get his final answer. He would estimate each one of the parameters, and this gave him a lot of levers to adjust in the face of new evidence and maybe having all of those different parameters would even allow random errors to cancel out. As we talked about the intuitions that led to each of us believing our method was more likely to be accurate, he revealed that he had a professional background in nuclear reactor safety and things like this were typical in the field! This conversation was definitely a highlight of the tournament for me, something I don’t think could have happened in another context, and probably my favorite teammate interaction overall.
Team engagement seemed to fall off over the course of the tournament, with fewer comments being made and chat messages being sent. All of this took place on a custom platform with somewhat challenging UX, which I think contributed to the attrition. There were also no external notifications, so you had to check in to find if there were comments or messages for you. This sort of guaranteed that all communications would be asynchronous which made it difficult to get too deep into anything. The UX also seemed to contribute to some forecasting errors for teammates. Sometimes they couldn’t figure out how to update their forecasts at all, and sometimes they just weren’t logically consistent. Like forecasting a 5% chance of something happening by 2030 but only a 3% chance of the same thing happening by 2050. A lot of team communication was spent trying to help correct those errors (with only moderate success) because part of the scoring depended on team performance.
Creating the combined team rationale wikis was challenging in the context of declining team engagement and I think the strong contributors ended up pulling a disproportionate amount of weight. This phase is also where the UX felt most challenging, with communication bottlenecks limiting how much collaboration could take place on a given document. I think the result was that most wikis were almost exclusively worked on by the editor, with minor comments sometimes left by others.
I don’t remember any significant updates to my own forecasts based on the other teams’ wikis, but I might have slightly adjusted some intuitions.
Results
Who knows! I’ll let you know in 2100!
But seriously, the research team was not even going to start scoring for accuracy until 2023 and I’m not sure when I’ll hear about my objective performance.
Personally, just based on thoroughness and effort, I felt like a top competitor. I also feel like I learned a ton by participating in this tournament and knowing what I know now would probably have done it even without a financial incentive. Seriously, it massively accelerated my understanding of both GCRs and forecasting.
If I understand the rules correctly (and they didn’t change), there were three subjectively granted financial awards per team, awarded by teammates via voting in a post-mortem survey. I won something for “best forecaster” and “best commenter”, which I think means I won two out of three possible awards on a team of 10+ competitors based on their votes. Thank you teammates. I feel like this at least partially validates my beliefs above. It also came with an invitation to join some sort of virtual conference with Phillip Tetlock at an undisclosed later date to discuss future research directions which I’m quite excited about.
I’ll update this section whenever I get more information on my accuracy. That is the whole point of forecasting, after all…
Research Takeaways
First of all, to those who put in the effort to make this tournament happen, thank you! If you’re reading this, I hope you can tell by now how great it was for me. I think I shared a lot of the below feedback in the post-mortem survey, but some of it I’ve since figured out upon further reflection.
Disclaimer: I participated in the tournament from the perspective of trying to forecast things as accurately as possible, explain those forecasts as convincingly as possible, and to help my teammates to do the same. However, the tournament organizers were primarily interested in testing out new methods of forecasting GCRs! This mismatch of objectives likely meant that some conditions of the tournament were optimized for them to gain knowledge about it vs. my ability to be accurate. All of the following is written from my perspective of partial knowledge. I’ll keep up with research that the team publishes and try to update this page with more context as I have it.
Overall, I’m worried that this tournament format does not ultimately converge on the best GCR focused forecasts and rationales that could be produced. Participant interest fell off over the course of the event, individual forecasts in general did not feel sufficiently “deep” for me to have confidence in them, and the final wikis didn’t feel like meaningful syntheses of the collected evidence and perspectives. I’ve guessed what I think were a few contributors to this, and I think they all represent potential directions for improving future forecasting attempts and related research.
Reducing the Friction of Collaboration
The UX of the platform was a major issue from this perspective. It increased friction for all participants, constrained communication, and I expect it contributed to attrition. Some aspects were just confusing, but I’m sure questions to organizers made these clear already. In hindsight, I’m worried I might have messed up some of the research methodology by doing so much work outside of it and then just copy/pasting things into the site! I constantly found myself wanting to create a Discord server to collaborate with my teammates in. It would have been amazing to have a video call and get to introduce ourselves, or even just real time notifications when they said something to me.
I expect that reducing this friction effectively makes the same amount of of participant effort “go farther” in terms of producing deeper forecasts.
Needed Deeper Research
I’m pretty confident that I dedicated well above average time and energy to this tournament… and it still fell of dramatically over the course of the event. Despite this, I left the tournament feeling like most of my forecasts were much too shallow and could have benefitted from more research! So what happened?
It felt like was that I was hitting rapidly diminishing returns from further research. It became harder to dig up relevant sources, and when I found them they were harder to parse through, typically because they assumed a lot of pretty specialized knowledge. Continually investing in an area with diminishing returns doesn’t feel great, and I had a lot of forecasting questions to choose from. While I could have invested a lot more time in a small number to maybe meaningfully improve them, I think I left the questions in a state where I wouldn’t have had time to meaningfully improve all of the ones I attempted. I think I targeted sort of the highest level of quality that I could uniformly apply.
I don’t think it had to work like this, though some of the time restriction is obviously intrinsic to the tournament environment. I wonder if there are alternate incentive schemes for forecasting and research that allow forecasts/rationales/research to be more public in a way that they can be continually built upon over time and connected to produce deeper and better supported reasoning. Alternatively, “subject matter experts” could have been used as dedicated researchers or reviewers focused on better informing the generalist forecasters vs. spending so much of their time forecasting themselves. While I’m sure this tournament structure will inform the field greatly about how well some things work, I’m very confident it’s not the right format for producing the best forecasts and rationales possible.
The collaborative wiki writing portion of the tournament took place when the participants were maximally burned out and was where the platform UX felt weakest to me. This is unfortunate, given these wikis were intended to be one of the primary outputs of the research. I’ve collaboratively written solid documents with other people before, but it usually involves repeated drafts with Google Docs comments and voice/video calls to discuss feedback. I think forecasting teams (see Samotsvety and Swift Centre) produce collaborative forecasts that are much, much more cohesive and useful than anything I saw in the tournament.
Reciprocal Scoring Doesn’t Seem Great
You can read more about the idea here, and I would guess that evaluating its practical use was one of the research objectives of this experimental tournament.
My intuition is that it confused a lot of participants in ways that reduced forecast accuracy. The theoretical rationale is that participants are supposed to assume that some other highly competent group is also attempting to forecast the truth, and that therefore the optimal strategy for them is to forecast the truth. I’m sure the math on this works out, but in practice I saw teammates with rationales like “Superforecasters will probably be underconfident so I’m reducing my estimates downward” or “Subject matter experts are pessimists so I’m increasing the probability” that surely degraded forecast accuracy. My guess is that just being asked to forecast these separately made participants feel like they should give different answers. This was my own initial intuition, and I honestly mostly came to the “I’ll pretend everyone is striving to forecast accurately and converging on the true answer” as a time saving method. I wasn’t familiar at all with the research basis for reciprocal scoring at the time.
This degradation in accuracy is similarly reflected in participant survey responses in the early research on reciprocal scoring that I discuss here. This plus my intuitions from participating myself make me think it’s intrinsic to the method, and this is an example of something that works very well theoretically with perfectly rational actors, but not so great in person.
I actually don’t think this is a big deal, and my guess is that the tournament was designed in such a way as to show this and that it will. Participants were asked to forecast their own beliefs alongside those of others for reciprocal scoring purposes, and I expect their own beliefs to be the most accurate. My intuition is that just asking people to be as accurate as possible and saying that a panel of the best superforecasters and subject matter experts will evaluate their forecasts accordingly, without giving them a formal scoring rule, will probably be an effective incentive scheme and greatly simplify things.
Moving Forward
Wow. This was fun. Just so cognitively stimulating and inherently worthwhile. I’m definitely looking for opportunities to work on similar things again. Thus this blog.
I feel like I barely scraped the surface of how forecasting relates to GCRs and there’s so much to learn. I definitely could have been more prepared for this tournament (instead of starting to prepare after it started), and I’ll be working in between focused events like this one to make sure I’m better prepared for the next one.
That being said, the time I allocated to strategy seemed well worth it and I think is primarily what differentiated me from most of the other competitors. I’ll try to continue this moving forward and share what I figure out here.
I have a lot more interest in the meta level of how to make forecasting efforts like this more accurate vs. how to become a highly accurate forecaster myself, but this “hands-on” experience was invaluable to that end as well. Overall this was an overwhelmingly positive experience for me, and I hope it was also useful from a research perspective. I’m really looking forward to reading whatever the researchers publish as an output of all their hard work.