Diving Deeper

Don’t Blame Polls: Know the Data Science to Analyze Results

Understanding polling and election modeling: How the behind data science and models that were seemingly off-target in 2016 are different in 2020.

With less than a week until Election Day, key polls are pointing in the Democrats’ direction. Former Vice President Joe Biden leads in nearly every key battleground state, and pollsters and pundits say his victory is likely.

For many Democrats, however, it feels like repeat territory, a PTSD of sorts from 2016, recalling pundits describing the race as “Hillary’s to lose” and assigning high chances to her victory.

But by late Election Night, Donald Trump had claimed the nation’s top elected office. Overnight, pollsters found themselves in the crosshairs of people blaming them for a surprise upset.

After such a discrepancy between expectations and results, can Americans trust today’s polls? Layer in pandemic complications, including a record-setting pace of absentee and early ballots voting, and the potential for surprise only rises.

Unpackaging what happened in 2016 begins by understanding the difference between polls and analytics. Polls give a snapshot of what a group of voters say they will do at a specific moment. Polls don’t predict what will happen next week or next year. And they can’t predict results of a presidential election.

When used properly, polls can be a powerful, reputable input for forecasting, which is where analytics come into play. Analytics uses raw polling data to predict the future. This practice of leveraging complex statistical models to forecast is best known by groups like FiveThirtyEight and the Princeton Election Consortium. Too often, political punditry of poll results masquerading as analysis is misinterpreted or misconstrued to imply a statistical prediction of the future.

This distinction is critical to understanding what went wrong in 2016 and offers insight for avoiding repeat issues today. The biggest myth of the 2016 election was that the polls told a far different story than the outcome. In fact, post-election analysis revealed polls were as accurate as they’d been since 1968. The data showed a close and uncertain race, full of volatility, and a possible Trump victory, all within the margin of error.

So, what happened? Many news analysts failed by frequently misinterpreting or putting their own, or party, spin on polling data. Whether fueled by wishful thinking or data illiteracy, the issue in interpreting the polls became the primary driver of Democrats’ disappointment.

Faring as poorly were most complex statistical models, which combine polling data with assumptions of how the world ought to function. Except for FiveThirtyEight, then criticized for ascribing Trump a significant chance at victory, most models were built on faulty assumptions resulting in predictions of a near-sure Clinton victory.

A common theme was that model builders set homestretch error assumptions too low. In the case of the Princeton Election Consortium model (whose creator Sam Wang famously ate a cricket on CNN after saying he would eat a bug if Trump got more than 240 electoral votes), this parameter was bullishly set below 1%, meaning these models touted greater than 90% probabilities of a Clinton victory. The media echoed that as a sure thing. Models that maintained assumptions more consistent with historical levels of polling error outperformed those that didn’t.

Another critical breaking point related to the unprecedented number of undecided voters. Many leading models assumed undecided and minor-party voters wouldn’t fall strongly in either direction. However, they overwhelmingly voted for Trump.

Four years later, both candidates are polling at higher rates than their respective 2016 counterparts. While it might sound counterintuitive, it’s simply the outcome of a much smaller percentage of the population that remains undecided on how they will vote — which currently figures are around 4%, as compared to the 16% of those polled at this same time four years ago. This helps enable more reliable interpretations.

In reality, only one prediction proved valid in 2016: The outcome was highly uncertain due to the number of swing states in play and poll results being within the error margin, even if most showed a Clinton edge. Neither news nor pundit put enough emphasis on the critical topic of statistical uncertainty, instead simplifying it to a predicted Clinton win.

Unfortunately, one final culprit of 2016’s surprise was general data illiteracy. While polling data portrayed a tight race, many people relied on pundits to interpret those numbers. Polls represent a proportion of votes that a candidate might get, while forecasting looks at the probability that will happen. People commonly mistook percentage points for votes. For example, if a model assigned Trump a 30% chance of victory, many Americans wrongly assumed he would receive 30% of electoral votes.

In so much that well-grounded forecasts can provide predictions of an outcome, they can’t factor in bombshells that might occur in the weeks leading up to the election. October surprises are part of presidential campaign lore. Four years ago, Clinton faced former FBI Director James Comey’s Congressional memo about her emails. Today, Trump is recovering from his recent COVID-19 illness and new opinions formed following the first presidential debate.

And while redesigned models can adapt to unknowns, we don’t have insight on related oddities in how votes are cast and captured in such an abnormal election cycle. Americans should rightly be suspect of any poll analysis that says the race is decided. Yet another bombshell could be right around the corner, and remember, the polling data is only good at that single moment in time.