## British headlines: 18% less informative shorter

Chris Hanretty, "British headlines: 18% less informative than their American cousins", 11/29/2013:

I’m currently working on a project looking at the representation of constituency opinion in Parliament. One of our objectives involves examining the distribution of parliamentary attention — whether MPs from constituencies very concerned by immigration talk more about immigration than MPs from constituencies that are more relaxed about the issue.

To do that, I’ve been relying on the excellent datasets made available from the UK Policy Agendas Project. In particular, I’ve been exploring the possibility of using their hand-coded data to engage in automated coding of parliamentary questions.

One of their data-sets features headlines from the Times. Coincidentally, one of the easier-to-use packages in automated coding of texts (RTextTools) features a data-set with headlines from the New York Times. Both data-sets use similar topic codes, although the UK team has dropped a couple of codes.

How well does automated topic coding work on these two sets of newspaper headlines?

With the New York Times data (3104 headlines over ten years, divided into a 2600 headline training set, and a 400 headline test set), automated topic coding works well. 56.8% of the 400 test headlines put in to the classifier were classified correctly. That’s pretty amazing considering the large number of categories (27) and the limited training data.

How do things fare when we turn to the (London) Times (6571 headlines over ten years, divided into a 6131 headline training set and a 871 headline test set)? Unfortunately, despite having much more in the way of training data, only 46.6% of articles were classified correctly.

Looks like those puns are [not] just bad for SEO, they’re also bad for the text-as-data movement…

We've noted from time to time that British headlinese is hard for Americans to understand:

"Fish foot spa virus bombshell",  9/10/2012
"Coin change 'skin problem fear' hed noun pile puzzle", 4/21/2012
"Lightning strike crash blossom", 20/27/2011
"Eight word BBC headline noun pile construction", 5/31/2011
"BBC Brit head noun pile win", 5/18/2011
"Headline noun pile length contest entry", 4/18/2010
"Brit noun pile heds quizzed", 3/5/2009

But you'd think that text classifiers, which typically treat texts as unordered bags of words, might actually prefer those noun-heavy Brit word pile puzzles.

Update — a quick examination of the underlying data suggests that neither puns nor syntax are involved in the explanation at all. Chris Hanretty found that the performance of a simple "text as data" classifier on Times of London headlines was 46.6%, whereas the same classifier trained and tested on New York Times headlines was 56.8%, for a ratio of 46.6/56.8 = 0.8204.

But if I fetch the NYT training and testing data and the Times of London training and testing data that Chris used, I find the following counts of headlines, words, and characters:

 Source Headlines Words Characters W/H C/H New York 3278 25381 154756 7.742831 47.21049 London 21854 138481 846944 6.336643 38.75464

38.75464/47.21049 = 0.8208904

which remarkably close to the ratio of classifier performances

46.6/56.8 = 0.8204225

… within 0.05% (five parts in ten thousand), in fact, suggesting that the average information density per headline character is very close to constant.

So really, the title of this post should be: "British headlines: 18% shorter".

1. ### Victor Mair said,

November 29, 2013 @ 8:29 am

I'm wondering if British headlines are sometimes intended to be teasers, in which case those who compose them might not want to reveal too much or be entirely transparent.

2. ### Ginger Yellow said,

November 29, 2013 @ 8:36 am

I'm wondering if British headlines are sometimes intended to be teasers,

Very much so. The approach is to entice the reader to actually delve into the article. If you give everything away in the headline, they don't need to read it. The trick is to give enough away to make sure that they're intrigued, but no more.

As for the results in question, I too was slightly surprised, but for different reasons. Of all British papers, the Times is probably the closest to the longwinded US style (or at least it used to be) and the furthest from the tabloid-driven noun pile pun fest.

3. ### Ellen K. said,

November 29, 2013 @ 9:16 am

Seems to me, if I'm understanding right, that this isn't at all accounting for headline accuracy. I've seen way too many (American) headlines that say something quite different from, often opposite to, what the actual article says.

4. ### Chris Waigl said,

November 29, 2013 @ 11:33 am

How well does the classifier do on Language Log post titles?

5. ### Mark Young said,

November 29, 2013 @ 11:36 am

Does the accuracy of the coding go up as the headlines lengthen? If we split the two test sets at their median lengths would the four ratios of average length of the test set members to classifier performance on that test set be approximately equal?

[(myl) The data and the classifier are all freely available on line, so this is one of many questions that you can easily answer yourself.]

6. ### Chris Hanretty said,

November 29, 2013 @ 12:40 pm

Thanks for the mention, and for re-running the analysis — I ran ahead of the data in thinking of the puns, and never thought to check headline length.

7. ### Mark Young said,

November 29, 2013 @ 1:23 pm

Yeah, now I know how my students feel when I refuse to do their homework for them! Ah, well, I've been thinking for a while I should learn how to use R.

Anyway, I've downloaded and installed R and RTextTools, and the data files you linked to. I've managed to load the files into R as data frames with 0 (!) columns.

Unfortunately I don't know where to get the classifier — or what code I'm supposed to use to run it if it's already part of RTextTools. It doesn't seem to be linked in your message or Chris's, nor can I find it by following the links and searching in either message (tho' I didn't follow the links to ft.com — it didn't seem like it'd be there). And I've used up all the time I can spare right now, so I guess I won't be able to report an answer any time soon.

I'll check back later to see if anyone's given any helpful hints….

[(myl) I think that collingwood_rtexttools_unc.pdf and lecture2_nytimes.r from here should be a good start…]

8. ### dw said,

November 29, 2013 @ 4:01 pm

Brit headlines in "Yank study classification shocker"

9. ### Chris Hanretty said,

November 29, 2013 @ 5:47 pm

Mark Young — the code I used is at https://gist.github.com/chrishanretty/7713010 — it's essentially the same as the file Mark Liberman linked to, except using London Times data instead of NY Times data. You'll want to download the Excel file from http://policyagendasuk.wordpress.com/datasets/ if you're following exactly what I did

10. ### Mark Young said,

November 29, 2013 @ 11:05 pm

OK, I've rerun the analyses using only the longer and shorter headlines in the London Times data set. I added the following lines to your code, Chris (just before the matrix is created):

### Select only long (or short) rows
midLength = median(nchar(media$Title)) media <- subset(media,nchar(Title) > midLength) numRows <- nrow(media) numTrain <- as.integer(11 * numRows / 13) testStart <- numTrain+1 (The > was changed to < for the second run. midLength was 37. nrow(media) became 3169 after the final subsetting for long titles, and 3259 after subsetting for short ones. I messed up the math a bit when I was breaking it into training/test sets (it should have been 13/15 to get the same ratio of sizes), but I don't think that's too much of a problem….) The code for creating the container was changed accordingly: corpus <- create_container(media_matrix, media$Major_Topic, trainSize = 1:numTrain, testSize = testStart:numRows, virgin = FALSE)

The ensemble_summary for long titles:
n-ENSEMBLE COVERAGE n-ENSEMBLE RECALL
n >= 1 1.00 0.49
n >= 2 0.66 0.63

The ensemble_summary for short ones:
n-ENSEMBLE COVERAGE n-ENSEMBLE RECALL
n >= 1 1.00 0.45
n >= 2 0.73 0.53

Now I'm not sure if I'm interpreting these results properly, but it looks to me like the longer titles don't give a much better result than the shorter ones. The mean number of characters for the longer titles was 51.6, so the longer titles from the Times were a bit longer on average than the titles from the NYT (47.2), and the sizes of the training and test sets were similar (2681 vs 2600 and 488 vs 400). Yet the classifier performs nearly as badly as the classifier on the whole sample — and about 13% worse than the NYT classifier.

Doesn't that suggest that it's not just the lengths of the headlines that cause the difference? (Honest question! That's what it looks like to me, but I am a statictics novice.)

11. ### Mark Young said,

November 29, 2013 @ 11:08 pm

PS: changing to 13/15 gives slightly worse results for the long headlines —
n-ENSEMBLE COVERAGE n-ENSEMBLE RECALL
n >= 1 1.00 0.48
n >= 2 0.71 0.58

I'm going to bed, now….

[(myl) The usual summary measure for classification accuracy is "F1", which is the harmonic mean of precision and recall. The reason for using some appropriate combination of precision and recall is that those two measures can usually be traded off against one another. This case is also a bit unusual in that there are many classes and some of them are not very well represented in the training and testing sets — so looking at the whole distribution of results across classes, and also trying cross-validation, might be helpful.

Anyhow, I believe that F1 is reported in the analytics variable — what did you get for F1 by class in the whole dataset and in the shorter and longer halves?]

12. ### Mark F. said,

November 30, 2013 @ 10:57 am

This isn't really about US vs UK headlines, it's about NYT vs The Times. I don't know how headlines in the Times (of London) compare to, say, those of the Guardian, but I do have the sense that NYT headlines are quite different from the typical American headline and, in particular, longer. Statistics may prove me wrong; I notice that the main headline (as opposed to the subhed) for Obama's 2008 election was "OBAMA". But, for the start of the Iraq war, it was "BUSH ORDERS START OF WAR ON IRAQ; MISSILES APPARENTLY MISS HUSSEIN."

13. ### D.O. said,

November 30, 2013 @ 11:11 am

@Mark Young. The program you are running, does it tell you which headlines were classified correctly and which incorrectly? If yes, you can look at distribution of correct classifications (on a test set, I guess, is best) vs. length. Maybe it should be distribution of F1, given that that's what people use in the trade. That would be more informative (I guess! I am not a statistician/machine learner). I would do it myself, but I don't speak R and have no desire to learn.

14. ### Mark Young said,

November 30, 2013 @ 7:47 pm

I am so out of my depth, now.

There is nothing called F1 in the analytics object (at least, not in the part of it that prints out). I did create an object using create_precisionRecallSummary, and it includes something called _FSCORE for both SVM and MAXENTROPY for each label. These range from 0.17 to 0.72 for SVM (with 3 NaNs) and from 0.20 to 0.74 for MAXENTROPY (with 2 NaNs) on the full sample. But there's no "overall summary" there, which is what I was expecting.

The analytics object does include the labels assigned for each of the test headlines — with columns named (among others) MANUAL_CODE, CONSENSUS_CODE, PROBABILTY_CODE, CONSENSUS_INCORRECT and PROBABILTY_INCORRECT. From those I should be able to calculate true/false positives/negatives, and from those an overall precision, recall and F1 — I think! But I don't want to do it unless I know that I'm going the right direction! (And I don't think they're paying you to teach me this — tho' I have appreciated the help so far and the opportunity to start playing with R.)

In any case, my original point was about the remarkable coincidence between the ratios of headline lengths and "classifier performances", and for that you were just using the percent correctly classified. It seemed to me that the conclusion you reached (18% fewer headlines were correctly classified because the headlines were 18% shorter on average) was probably wrong, and I *think* running the test with only the longer headlines from the LT sample showed me correct — so long as that number I picked out from my analysis was, in fact, the proportion of headlines correctly classified (I'm not certain of that).

At this point I think I'll leave it to Chris (if he's interested) to do any further analyses. I've posted the code I used to reduce the data set to something very similar in headline lengths and number of data points to the NYTimes data set — and I learned a lot from doing it — but I'm not interested enuf to keep going.

So long, and thanks for the fish!

15. ### CrisisMaven said,

June 18, 2014 @ 10:55 am

Indeed, the British use strange headlines, a strange cuisine and have a weird sense of humor (maybe that all feeds into itself …). However, what the average US reader in the US hinterlands doesn't appreciate is that every European newspaper carries a lot of international news not only on two outside pages while most (local) US newspapers, with the exception of NYT or Washington Post, that are sold in the US have nothing but football, baseball and news about crop pests and prices "inside". The parochial world view of US citizens is proverbial in Europe. So maybe headlines need to be more specific there or else they wouldn't be understood.