Civic Hacking and Journalism How data is changing civic engagement

A few thoughts about data journalism from the Mozfest 2015

A few days ago I attended my very first Mozfest. Now I understand why survival guides have been written for this event. It’s so huge and busy you can easily get lost and overwhelmed! Just as I did with the Daten Labor conference a few weeks ago, I want to write down some interesting observations here.

The importance of programming languages and computational thinking in data journalism

I talked with a lot of coding journalists and technologists working in journalism. What I found striking was the amount of self-taught programmers. For many, the journey into data journalism began with a personal interest in programming. This made me thinking: When we talk about data journalism we talk a lot about, well, data. But I never read about the importance of high-level programming languages like Python or Ruby and their frameworks and libraries to make programming relatively easy to learn and fast. It seems hard to imagine data and programming would play such a big role in journalism today without them. It also became apparent that Stack Overflow does contribute a lot to this development. Out of curiosity I asked several coding journalists: What would happen if Stack Overflow was suddenly gone? The answers ranged from “my productivity would drop considerably” to “I don’t want to think about it”. It seems that the ubiquitous availability of data alone would not have been enough to make data journalism such a big deal. Only in combination with developments that made programming relatively easy to learn and compatible with journalistic workflows data journalism could become more widespread.

In relation to the role of modern programming languages, something that occurred to me both at the Mozfest and the Daten Labor conference was the importance of computational thinking for data journalism. Computational thinking means “formulating problems and their solutions so that the solutions are represented in a form that can be effectively carried out by an information-processing agent” (Wing 2010). It involves abstraction, modularization, “and automation via algorithms to enable scale(Diakopoulos 2016). At the Daten Labor conference, a journalist told me that the most significant change in her newsroom was not so much working with data but the automation of repetitive tasks. At Mozfest, Phillip Smith, who has experience with teaching journalists how to code, told me that the first step for many aspiring data journalists is to think about which parts of their work could be automated to get a grasp of how coding is useful for them. Without the ability to think computationally, it is difficult for journalists to integrate working with data into their workflows. Some researchers in Journalism Studies like to distinguish between data journalism and ‘computational journalism’ (e.g. Coddington 2015), but I don’t think this distinction holds in practice – computational thinking easily leads to working with data, while data journalism without computational thinking seems very limited. Maybe we should talk less about a ‘data revolution’ and more about a revolution in programming affordances and computational thought?

Transparency in data journalism: A luxury?

As on the Daten Labor conference, the transparency and reproducibility of data journalism and science was an important theme. There was a session about Jupyter notebooks which was attended by journalists, technologists, and scientists. There was also one about the difficulties and pitfalls of working with data in journalism. For scientists, transparency and reproducibility is (or should!) be a natural concern to ensure the validity of the research. Journalists want to ensure that their claims are true as well, but the relationship to transparency and reproducibility seems to be a bit less straightforward. There was an interesting comment by Sisi Wei of ProPublica at a panel discussion: the number one fear of every journalist is to have to correct something, number two is getting scooped. The approach of ProPublica is to be very closed before publication and as open as possible afterwards. But to cope with fear number one, a lot of work has to be invested to make sure there are no errors. Moreover, the data and the code needs to be published with sufficient documentation to prevent misuse and to ensure others can utilize it. I think there is a danger here that transparency ends up only being practiced by larger newsrooms with more resources, while smaller ones rather ‘play it safe’ due to fear number one. However, Sisi also suggested that the openness of larger investigative newsrooms like ProPublica allows local journalists to use the available data and tools to create stories for their own areas. This could help to mitigate the issue. Another help could come from hyperlocal websites and civic tech applications provided by NGOs such as mySociety. More routines and established guidelines when it comes to data reporting might also help. Still, it seems we are far from having transparency and reproducibility as norm in data journalism.

References

Coddington, Mark. 2015. “Clarifying Journalism’s Quantitative Turn.” Digital Journalism 3 (3): 331–48. doi:10.1080/21670811.2014.976400.

Diakopoulos, Nicholas. 2016. “Computational Journalism and the Emergence of News Platforms.” In The Routledge Companion to Digital Journalism Studies, edited by Scott Eldridge II and Bob Franklin. http://www.nickdiakopoulos.com/wp-content/uploads/2011/07/Computational-Journalism-and-the-Emergence-of-News-Platforms.pdf.

Wing, Jeannette M. 2010. “Computational Thinking: What and Why?” Pittsburgh. https://www.cs.cmu.edu/~CompThink/papers/TheLinkWing.pdf.

Daten Labor 2015: Data journalism and science

Last week I visited the Daten Labor conference in Germany, which brought together data journalists and scientists and offered workshops. It was a great opportunity to meet both established and aspiring data journalists and to get an impression of the ‘state of the art’ in Germany. It’s impossible to do justice to this conference in just one blogpost, but I’ll try to collect some interesting observations here.

1. Data journalists just want to be journalists

A discussion panel about how data journalism and science could profit from each other showed that scientists and data journalists have different views on what data journalism is or should be about. Some of the scientists in the panel were hoping that data journalists could help to explain the meaning and the consequences of big data technologies to the public and thereby hold Google accountable (I mention Google here because Google – and not governments – was the example used in the discussion). In other words: data journalists should be involved in what Nicholas Diakopoulos has called algorithmic accountability, the investigation of algorithms used by companies like Google or governments. This implies that data journalists and scientists should work closely together and that data journalists’ primary task is communicating the insights to the public. Recognizing that journalists and scientists work in different time scales and have different constraints, scientists should provide journalists with tools that help them to profit from scientific insights during their investigations.

Data journalists were skeptical about these ideas and engaged in some boundary work. They don’t see themselves as ‘Algorithmists’, the term Mayer-Schönberger and Cukier (2013, 180–82) use to describe those who should review algorithms to prevent abuse. They seem to see data journalism primarily as a continuation of traditional journalistic work practices, i.e. as a set of new techniques to find and tell stories and not as a new way to mediate between scientific research and the public. Moreover, the data most journalists work with is descriptive, relatively small in size, and the analysis performed on it is described as rather simple (finding minima, maxima, outliers and so forth). A deep insight into statistics or the inner workings of algorithms would not be necessary for the fast majority of data journalism performed today. If special knowledge is required it is common practice to ask external experts.

Nevertheless, it became clear that scientists and journalists could cooperate for mutual benefit – journalists could profit from the insights of researchers by using their tools, while scientists could test their theories and algorithms with the data and real use cases provided by journalists.

2. Transparency and reproducibility in data journalism

An important theme for both scientists and journalists was the transparency and reproducibility of how data journalists gain their insights and produce their stories. Considering that researchers have been struggling to ensure the transparency and reproducibility of their own work for centuries without finding a perfect solution, journalists and scientists should cooperate on this issue to find new ways suitable for today’s technological affordances. While all agreed that using open source tools and sharing the code behind the story is a good idea (at least in theory, see the next point), how to make sure that others can really recreate and evaluate the development process? One possible solution discussed at the conference were Jupyter notebooks. Jupyter notebooks combine the writing of text with the writing and execution of software code. This means that you can start a notebook with a long text-only introduction followed by a code snippet together with it’s output, followed again by some discussion of the code and so forth – you can check this example to see how it looks like. Jupyter notebooks are intended to be used in an iterative and explorative way and make every step taken by the writer/developer transparent. I think it was quite remarkable that Fernando Perez, the creator of IPython and now leading developer of the Jupyter project gave a keynote about his work at the conference. He showed an interesting example of how this technology has been used in journalism: Brian Keegan’s extensive critique on FiveThirtyEight, which used a Jupyter notebook to recreate and question the work of journalists. There were a few journalists at the conference who already use Jupyter notebooks for their own work. For them, the value of these notebooks was not only transparency, but increased productivity: being able to easily recreate the steps taken on a dataset a few weeks ago saves a lot of time and makes collaborations easier.

Will Jupyter notebooks be a big thing in data journalism? I have some doubts about that…

3. Data journalism is full of experimentation and uncertainties

It became clear that data journalism is far from being an established from of news reporting in Germany, both in terms of organizational structures and in the way it is practiced. This already starts with defining what data journalism actually is – it is telling that several keynotes and workshops started with providing a definition. Organizationally, the way data journalism is integrated in newsrooms varies quite a bit and newspapers are still experimenting a lot. In some cases, there is a distinct team that only does time consuming investigative stories while in other cases data journalism is more integrated into daily news reporting. In many cases, however, it seems that there is no clear structure and data journalism projects are organized ad hoc.

When it comes to how data journalism is practiced by individual data journalists, I tweeted this:

To be clear, it’s not that I think data journalists don’t know what they are doing. What I mean is the way they talk about how they work and what level of expertise they have. There seems to be a lot of insecurity and experimenting involved in ‘doing’ data journalism. One expression of that is the reliance on finding information on Google. Julius Tröger, who works at the Berliner Morgenpost and gave a workshop on mapping data, gave the ultimate advice to deal with ‘geeky stuff’: search for the name of the tool you want to use on Google and add ‘for journalists’. Some journalists told me that they constantly search on Google or Stack Overflow while they’re coding to slowly tackle the issues they are concerned with. It also seems common to create a set of re-usable scrips over time. Probably as a result of that, journalists tended to describe their code as ‘lousy’ or ‘dirty’. It seems that the average data journalist is far from being able to examine and critique the algorithms behind big data, something scientists were hoping for. Moreover, this insecurity in their own coding skills made some journalists hesitant to the idea to share their code online, which implies a bit of a conflict between theory and practice: while all journalists agreed on the importance of transparency and reproducibility, I’m not sure how many of them would be willing to share their research process as a Jupyter notebook (or some other form).

To summarize…

To sum up, what I take from this conference is:

  1. For journalists, data journalism is not a revolutionary new form of journalism, but a set of techniques to extend the traditional journalistic work of finding and telling stories.
  2. The lack of transparency and reproducibility is recognized as an issue, it will be interesting to see what kind of solutions will become more widespread in the future.
  3. Data journalism in Germany is still in its infancy and is marked by experimentation and insecurity.

As I mentioned above, these are only a few observations. If you understand German, make sure to check the conference documentation provided by the organizers.

References

Mayer-Schönberger, Viktor, and Kenneth Cukier. 2013. Big Data: A Revolution That Will Transform How We Live, Work and Think. London: Murray.