Data journalism and film dialogue

« previous post | next post »

Hannah Anderson and Matt Daniels, "Film Dialogue from 2,000 screenplays, Broken Down by Gender and Age", A Polygraph Joint 2016:

Lately, Hollywood has been taking so much shit for rampant sexism and racism. The prevailing theme: white men dominate movie roles.

But it’s all rhetoric and no data, which gets us nowhere in terms of having an informed discussion. How many movies are actually about men? What changes by genre, era, or box-office revenue? What circumstances generate more diversity?

To begin answering these questions, we Googled our way to 8,000 screenplays and matched each character’s lines to an actor. From there, we compiled the number of lines for male and female characters across roughly 2,000 films, arguably the largest undertaking of script analysis, ever.

This is a fine example of modern Data Science, showing how to do it outside of academe and big-company research labs.

Matt Daniels sees this work as a kind of journalism ("The Journalist-Engineer", Medium 10/25/2015):

Lately, some of the best articles in the NY Times and Bloomberg are 99% code. The end-product is predominantly software, not prose.

He asserts that the new things he likes are

raw numbers without any abstractions. There’s no attachment to the news cycle. There’s no traditional thesis. It cannot be made in Photoshop or Illustrator. You must write software.

It represents the present-day revolution within news organizations. Some call it data journalism. Or explorable explanations. Or interactive storytelling. Whatever the label, it’s a huge shift from ledes and infographics.

And what happened, according to him, is that

Creative coders turned their sights from media art to journalism. They’re writing software about ideas that have eluded traditional news organizations, either because they were too complex to explain in prose or they were trapped in a spreadsheet/academic paper.

From his examples, you might conclude that "data journalism" is another term for "animated .gifs".

But I'd argue that the "data" part is key, and that the shift to animations, cool as they are, is more like a change in typefaces and page design. And in fact the animations in the "Film Dialogue" piece are just superficial eye-candy transitions from one graph to another.

What's effective about that piece is the underlying data analysis, and its presentation in accessible graphs and tables — which are not in any sense "raw numbers without any abstractions".  Crucially, the authors' analysis relates the "raw data" to relevant, interesting, and accessible abstractions, summarized in accessible graphics. This is exactly what John Tukey taught us to do more than half a century ago, under the name of "Exploratory Data Analysis".

And the relationship of this kind of data analysis and graphical exploration to "writing software" is worth thinking about. There's nothing that Anderson and Daniels did that couldn't in principle have been done fifty years ago. But they would have had to spend months or years in library archives counting lines in screenplays by hand, and more months or years exploring different ways of summarizing the results, again by hand. And of course that never would have happened.

What's different now is that we have cheap powerful computers, networked digital archives, and accessible free or cheap software systems. This makes it possible to do many kinds of large-scale data exploration in minutes, hours, or days, rather than months, years, or decades. And it makes it possible for ordinary people to present the results to a large audience, without owning printing presses and distribution networks.

All of that is a Good Thing, in my opinion. Though the future is not entirely rosy… "Data journalism" allows all the usual routes to bullshit: the file drawer effect, data dredging, model shopping, confirmation bias, software bugs, and outright fraud. And the same goes for the application of similar techniques in politics, advertising, and public relations more generally.

Is there a solution? Certainly not the traditional academic review system, which mainly frustrates and delays publication, without doing a good job of preventing the cited sins (as the reproducibility crisis shows).  We can hope that increased access to research and publication opportunities, along with improved systems of social-network-aware linkage, will eventually provide better solutions. Though the conservatism of academic culture will add decades to the process, and the efforts of "amateurs" like Anderson and Daniels may lead the way…

Anyhow, Polygraph has plenty of other examples of interesting data journalism focused on popular culture.

[One suggestion: The authors did a fair amount of work to collect scripts and connect lines to roles to actors, and actors to variables like sex and age. If they can find a way to make the resulting dataset available to others, it would facilitate many other sorts of investigation. And other investigators would be able to add to the data, correct errors, and so on. UPDATE — they're ahead of me on this, with much of their data available on github.]

[Finally, let me register one small objection to the Film Dialogue piece. There's no bibliography or literature review, so the authors don't mention the earlier work on Disney Princess films by Carmen Fought and Karen Eisenhauer. UPDATE — as commenters immediately pointed out below, my textual search for Fought and Eisenhauer in the Anderson and Daniels piece missed the fact that the authors link to the WaPo article on the Fought & Eisenhauer LSA presentation.]



  1. Yuval said,

    April 10, 2016 @ 8:57 am

    Oh, but they do mention Fought and Eisenhauer – they link to the WP piece right below the first figure.

  2. Ben Zimmer said,

    April 10, 2016 @ 8:57 am

    No explicit citation to Fought and Eisenhauer, but they do link to Jeff Guo's Washington Post piece about their LSA paper, which generated lots of media attention.

  3. Rebecca said,

    April 10, 2016 @ 11:47 am

    Do they ever say how they chose the 2000 films out of the 8000 scripts at their disposal?

    [(myl) I don't think they say. I wondered about this myself — it's possible that they ran out of time in assigning actors to parts to lines; or 6000 scripts were duplicates, or corrupted, or partial, or something…]

  4. Linda Seebach said,

    April 10, 2016 @ 1:31 pm

    Neuroskeptic has a recent post, noting a new site that "reveals the enormous variety of different ways which psychologists have devised to analyse the data from the same experimental task." In no more than 120 papers, there are 147 different ways to present results from the competitive reaction time task.

    A new way to fudge!

  5. Rubrick said,

    April 10, 2016 @ 4:40 pm

    I see a possibly key barrier to this transition: Do schools of journalism routinely offer (require) courses in statistics and programming?

  6. Nancy Friedman said,

    April 10, 2016 @ 7:40 pm

    @Rubrick: I can't speak for other J-schools, but I know my alma mater, UC Berkeley, offers classes in "coding interactives," data visualization, and visual storytelling.

  7. andyb said,

    April 16, 2016 @ 4:26 pm

    You ask, "Is there a solution?"

    I think putting both the data and the software up on GitHub, as they've done, is a large step toward the solution. This is basically the same as the argument for open source software (as in "more eyes means more quality", as opposed to the argument for free software, "people have the inherent right to see and modify the code they depend on").

    People who don't like their conclusions can download everything and look for specific problems, instead of just guessing, or insinuating the kinds of things they might have done wrong. And when they publish their counter-argument, they'll include their modified software searching the same data, or a filtered version of the data, so others can compare the assumptions each side made.

    Of course open source isn't a magic bullet for software (it's the worst way of doing software, except for all the other ways that have been tried), so it won't be for journalism and research either. But as a starting point, I think it'll help more than peer review, or whoever has the shoutiest pundits winning, or anything else we've been doing.

RSS feed for comments on this post