In 2008, Nate Silver, a relatively unknown baseball statistician, correctly predicted every Senate race and all but one state in the presidential election. He accomplished this by neither physically reporting from the ground nor by using some esoteric technique of political science. Instead, he used basic statistics to analyze the large volume of polls available and predict an outcome. The message was clear: data-based electoral predictions appeared to be significantly more accurate than predictions based on traditional political science.
Since then, data journalism has become increasingly popular and has made analysis of elections and other issues more accurate and quantitative. However, it is wrought with difficulties. Most journalists are not trained statisticians and don’t know how to interpret accurately the probabilistic nature of data nor do they know how to deal with models with seemingly contradictory conclusions. More importantly, journalism is not yet fully aware of the latent limits of data-based reporting.
A Brief History of Data Journalism
Data journalism is the analysis of statistics to numerically justify stories and make predictions. To some degree, journalists have used data since the start of its widespread accessibility; some of the first examples of computer-analyzed data-based stories come from Harvard’s Nieman Foundation in the 1960s. The increased volume of polls that became publicly accessible with the development of the Internet in the early 2000s made data journalism more mainstream. Among the first sites that began focusing on data journalism was RealClearPolitics (RCP), whose purpose was to collate polls along with interesting political editorials to allow the public to find both forms of political information on one site. RCP’s first foray into true data analysis was the development of the RCP polling aggregate, which summarized all the publically available polls by reporting their median. This simple statistical analysis took the first step towards eliminating possible polling biases and allowing the public to gain a better perception to the true state of the race. However, data-based predictions using these techniques in the 2004 presidential elections were not particularly successful compared to their non-data-based counterparts.
The 2008 elections saw the emergence of highly accurate data-based electoral predictions. Nate Silver’s FiveThirtyEight was one of several sites that predicted the outcome of the election with much greater accuracy than traditional methods. Since then, the results of every midterm and Presidential election have fallen within the error ranges of most data-based methods.
FiveThirtyEight and other sites, such as Vox.com, have attempted to apply data to other forms of journalism beyond electoral and sports predictions. Common applications include testing conventional wisdom in political science such as the opposition party gaining seats during midterms and the party affiliation of a state fully determining Congressional races, and looking at the effects of certain media-hyped events on the campaign. Another trend is the increasing use of journalist-created or crowd-sourced datasets, such as the project by the California Civic Data Coalition that aims to make campaign finance data in the state of California more accessible. More recently, data journalism has been extended by the relaunch of FiveThirtyEight at ESPN to analyze the benefits of college, nutritional guidelines, and restaurant rating systems. Nate Silver is pushing the limits of data journalism and appears to be outcompeting traditional journalism across all fields. Naturally, one asks: will the future of journalism be reporters staring at pollster’s statistics on endless Excel spreadsheets?
Data is a Double-Edged Sword
As powerful a tool as data is, it is also easy to misuse. Journalists are often not statisticians, and the few that are publish their predictions and analyses on sites that aren’t mainstream. In an interview with the HPR, journalist Sasha Issenberg noted that ”journalists are largely unsuited for [data-based] analysis.” Many prominent media outlets such as the New York Times unintentionally misreport data predictions when they report to the general public. For example, this article falsely asserts that Nate Silver has “already decided the election.” In reality, Silver has stated that the Republicans are favored to win, not that they will win. Even worse, other sites and analysts intentionally misrepresent data in order to confuse the public; this site reassured readers of a Romney victory by intentionally inserting arbitrary Republican bias into the polls. As a result, data is often a double-edged sword: it can help improve the public’s awareness of the world around them, but it can also dramatically mislead the public.
The first, most common, pitfall is that data is inherently probabilistic. Predictions are not reported as certainties or facts; they have associated probabilities, which in more advanced analysis have their own error terms. These probabilities arise from a variety of sources: sample sizes and random error of polls, polling biases, potential flaws in the method, etc. In an interview with the HPR, RealClearPolitics editor Sean Trende said that “[in order] to get published in a peer-reviewed journal, you generally have to show a probability of 95 percent [as that is] our threshold for [calling a statement] knowledge.” Any prediction made with less than 95 percent probability cannot be accurately quoted as a truth. Electoral predictions give an estimate of the likelihood of Republicans or Democrats winning a seat; they do not see the future.
The probabilistic nature of data is not respected by many publications. Predictive headlines such as “Republicans will win the midterms according to Nate Silver” appear before and after every election, and yet they are completely wrong. Data-based predictions are not necessarily correct if the side that they say is favored to win actually wins, nor are they necessarily incorrect if the side that is favored to win loses. There are two interpretations of what constitutes accuray in a data-based prediction. One is that over a long period of time candidates favored to win around 70% of the time actually win 70% of the time. Nate Silver checks his 2014 NCAA tournament predictions in this manner. The second is the use of the Brier score, which is described here and then used to evaluate predictions of the 2012 presidential elections.
This misunderstanding has profound effects on the quality of electoral reporting. For example, most of the public prior to the midterms believed the Republicans are guaranteed to win the Senate based on the data. As a result, conservative news anchors tried to defend FiveThirtyEight’s predictions while prominent liberals stated in public that data-based analyses do not adequately account for potential biases in the polls. However, FiveThirtyEight already accounts for any potential polling bias when it calculates winning probabilities; this is why its margin of error is significantly larger than that of the individual poll. This is not an isolated phenomenon; it has occurred around the time of every election when Nate Silver has made a prediction, and it significantly decreases the quality of reporting around every prediction he makes.
The second, more subtle, pitfall is that while data is objective, any analysis of data must be subjective. The increasing volume of data available has highlighted a significant problem for data journalists: it is possible to find data saying almost anything. Data analysis must be performed to determine what the “truth” is or to make predictions, but this analysis has assumptions built into the model. In an interview with the HPR, pollster and columnist Kristin Soltis Anderson said, “[data analysis is like] saying you’re baking a cake […] and ultimately, the final ingredient you’re adding are your own biases and assumptions.” Some sites, such as FiveThirtyEight, justify their assumptions based on scientific analysis and then provide readers with a detailed description of these assumptions. But even the best models have these assumptions, and a lack of awareness of these assumptions and their implications has profound effects on the quality of data journalism.
A major effect is that many organizations only look at data that supports their own views or perform data analyses with assumptions that lead to favorable results. In the 2012 elections, the conservative newspaper Washington Examiner published a data-based prediction suggesting that Romney would easily beat Obama. Anderson agreed that many Republican outlets in the 2012 elections tended to cite constantly the Gallup polls, which have a known Republican bias. This even affects campaigns: when independent polls are producing wildly differing results, internal polls tend to have significant bias towards the campaign.
In order to ensure more accurate reporting, election modelers and pundits first need to be more proactive about making their assumptions publically available and easily understandable, much like how FiveThirtyEight does. Further, Issenberg commented, “[We shouldn’t be] debating whether somebody’s model that gives Republicans a 58 [percent] chance is really accurate or not.” Since it is impossible to evaluate accurately models until after the election, excessive media focus on the differences between models is unnecessary ; instead, the media should increase awareness about the campaign and its platforms. At best, journalists should synthesize all the scientifically accurate models and report their results with as little bias as possible; this ensures that the public receives an accurate perspective of what the election would look like.
Scientifically accurate data journalism that satisfies all of the above criteria is hard to find given the number of data journalists and the number of journalists reporting about data. Before the full power of quantitative data journalism can be opened to the public, the field needs to be made more accurate and less misleading.
How Far Can Data Go?
When correctly done, data journalism can seem incredibly powerful. Probabilistic predictions of virtually anything are possible with sufficiently complicated models and complete information. One of the most powerful applications of data journalism evaluates the effects of a particular action, hypothetical or real, on public opinion or on the state of the campaign. For example, FiveThirtyEight analyzed the effects of various events in September 2012 such as Romney’s 47% gaffe on the state of the campaign, and discovered that the media hype around major gaffes was largely inconsequential on the polling numbers. This type of analysis when applied to areas like public policy or economics helps journalists debunk “common knowledge” theories and replace them by quantitative statements that accurately predict future outcomes.
It is virtually impossible to compete with data by using traditional means; in the time a traditional journalist can identify a potential trend by interviewing five people, a data journalist can analyze statistics relating to five million people and make generalizations about the whole country. In an interview with the HPR, Issenberg postulated, “In 2020, anybody who is doing any good work in journalism [will be] doing a version of what in 2014 seemed like [data journalism].” Director of the Nieman Journalism Lab, Joshua Benton agreed, saying that in the near future, journalists should at the very least “be familiar with a basic idea of what a dataset is [and] where it is.” The increasing availability of data makes some form of data analysis a requirement for almost any article.
But even the most powerful and most accurate data journalist is limited. Data can only analyze or confirm trends that are observed on the ground. “You absolutely have to tell human stories,” says Anderson. Data cannot identify movements of people, nor can it explain how people think or why they think the way they do. It can only generate hypotheses about how people will behave and make models that analyze whether those hypotheses are true. For example, FiveThirtyEight’s coverage of the recent Ebola epidemic consists of six articles. Four of these are simple fact-checking on claims made by the popular media regarding the virulence of the epidemic. These articles are important because they help combat the inaccurate use of data by CNN and other news agencies but, they do not provide a new angle on the story. The fifth provides a quantitative perspective on whether an Ebola flight ban is capable of stopping the epidemic. This provides readers with an understanding about why the flight ban policy is destined to fail, but it does not allow readers to understand why the flight ban policy is immoral.
In contrast, this non-data piece by the Washington Post has a similar thesis, but provides sharply different justifications, describing why the world has a moral obligation not to ban flights to and from West Africa to ensure humanitarian aid. A combination of the two pieces gives readers a full perspective: both an understanding of why flight bans are ineffective and why they are immoral even if they were effective. Neither piece alone can provide this understanding.
The attempted overuse and extended misuse of data today is not just limited to journalism. Society is approaching a cult of scientism, where all aspects of our lives are distilled into numeric calculations and decisions are made based on calculations. This approach is incredibly powerful, as numbers are more informed and more objective than a person’s feelings can be. However, people must be aware of the limitations of data: any model inherently introduces its own assumptions, which must be tested, and no model can understand certain aspects of an issue. By 2020, data will be a ubiquitous part of our lives, just as it will be a ubiquitous part of journalism. It remains to be seen whether it will be a boon or a hindrance for us.