Wikipedia article length

« previous post | next post »

For various reasons I recently downloaded snapshots of Wikipedia in various languages, and I'd like to share with you some discoveries, starting with article length in the English Wikipedia.

As of the May 3 version, which comprises 5,110,263 articles, the longest article (at 50,122 words) is "California Proposition 218 (1996)". The top ten in terms of length in words are::

1 California Proposition 218 (1996)
2 List of Warriors characters
3 List of Dutch inventions and discoveries
4 South African labour law
5 Secret Wars (2015 comic book)
6 Constitution of Myanmar
7 Characters of Supernatural
8 The Dresden Files characters
9 List of Redwall characters
10 Characters of the Mass Effect universe

Numbers 15 through 18 are:

15 British literature
16 History of Austria
17 History of Cumbria
18 History of Eglin Air Force Base

The top individuals on the list are:

19 Hubert Gough
21 John French, 1st Earl of Ypres
30 Sir Henry Wilson, 1st Baronet

In between John French and Sir Henry Wilson, we find among others

24 History of Oregon State Beavers football
25 History of Western civilization
26 List of Friday the 13th characters

Overall, Wikipedia article lengths correlates with cultural importance, I guess. But there are some quirks.

Update — In case you want to do something similar, here's the script I used to get the snapshot:

#!/bin/sh
WHICH=enwiki
wget http://download.wikimedia.org/"$WHICH"/latest/"$WHICH"-latest-pages-articles.xml.bz2

mkdir extracted
bzip2 -dc "$WHICH"-latest-pages-articles.xml.bz2 |
       WikiExtractor.py -cb 250K -o extracted
find extracted -name '*.bz2' -exec bunzip2 --stdout '{}' \; > $WHICH.xml

Obviously you could substitute other wikinames for "enwiki" in order to get other languages. And perhaps someone will tell me that I should have used different flags for WikiExtractor.py in order to to get the full version of articles like "1918 New Year Honours"…

 



27 Comments

  1. DWalker said,

    May 18, 2016 @ 5:20 pm

    The prop 218 article is pretty interesting, thanks for that pointer! I remember Prop 13.

  2. DWalker said,

    May 18, 2016 @ 5:21 pm

    By the way, it's interesting that the extremely long Prop 218 article has NOTHING on the "Talk" page. Several short and otherwise-un-noteworthy pages have quite long Talk pages.

  3. Guy said,

    May 18, 2016 @ 5:52 pm

    Maybe I'm just pointing out the obvious, but 24 and 25 is pretty hilarious.

  4. Anschel Schaffer-Cohen said,

    May 18, 2016 @ 5:53 pm

    Conjecture: unusually long Wikipedia articles have the characteristic that they haven't attracted a devoted editor who splits them up when they get too big, and thus are not the most culturally important ones. For example, the article titled "Barack Obama" is not so long, because there are separate articles about e.g. "Family of Barack Obama" and "Economic Policy of Barack Obama".

  5. Guy said,

    May 18, 2016 @ 5:54 pm

    (And yes, I realize that 25 is only an overall summary of a topic that will be covered in detail in countless other articles, but still.)

  6. AG said,

    May 18, 2016 @ 5:57 pm

    I think the articles that are long lists of characters are actually indicators that those fictional universes are either a) of *medium* interest, b) of high interest but only to a very small fanbase, or c) of high interest but to a non-tech-savvy or non-wiki-tropic fanbase. Anything more popular (Star Wars, Pokemon, etc.) probably already has a vast number of separate articles on each character, or have its own much larger wikis.

  7. AG said,

    May 18, 2016 @ 6:01 pm

    *has

  8. Ken S said,

    May 18, 2016 @ 6:35 pm

    This seems pretty odd list of things to me. I don't know about these, but keep in mind things can be very skewed by one person or one bot – see https://meta.wikimedia.org/wiki/List_of_Wikipedias and notice that Cebuano and Waray-Waray are in the top ten languages (due to some guy writing a bot to add articles)

    Is this on parsed plain text or with the wiki-markup straight from the xml dumps? Last time I looked (although years ago now), it wasn't that easy to find something to give you clean text with the markup stripped out.

  9. Ethan said,

    May 18, 2016 @ 6:41 pm

    @AG: The fanbase for Redwall consists mostly of young readers, generally young enough that I would not expect them to be re-organizing Wiki pages. I gather the same is true for Warriors, although that series came along after my kids-starting-to-read-chapter-books time of life so I don't know the target age exactly.

  10. Chris C. said,

    May 18, 2016 @ 6:50 pm

    I'd say that Wikipedia article length correlates more directly with a ratio of determination of the wonks who believe the subject is important, divided by the number of wonks who share that interest.

    Someone is really interested in California Prop 218. But there can't be very many like him, because otherwise Wikipedia's normal community preference for shorter articles would have come into play, and the article would be split up into a series of shorter ones, which is what usually happens when there are many niggling details and ramifications. You get a main article, which merely summarizes each subtopic and links to a more complete sub-article on each. (I looked at the history after writing that paragraph and indeed found the bulk of the text to have been contributed by exactly two editors. One is an anonymous editor posting from an IP address owned by UCLA.)

  11. Frédéric Grosshans said,

    May 19, 2016 @ 2:51 am

    @Ken S: wp2txt strips the Wikipedia markup away and works pretty well.

  12. eyesay said,

    May 19, 2016 @ 3:37 am

    This is completely wrong. Wikipedia maintains its own Long pages list, which lists the longest pages. The claimed longest page, ‎California Proposition 218 (1996), is #39. The claimed 2nd longest page, List of Warriors characters, is #24. The claimed 3rd longest page, List of Dutch inventions and discoveries, is #9. The claimed 4th longest page, South African labour law, is #98. The claimed 5th longest page, Secret Wars (2015 comic book), is #220. 14 of the 30 longest articles are "List of" articles.

    Wikipedia's list is sorted by number of bytes. Language Log's list is sorted by number of words. These two should be pretty much in the same order. Word count can be measured in more than one way. For instance, do you count the words in the table of contents? But to make things simple, I copied the entire page, including Wikipedia's own navigation, and pasted into WordPerfect. For the claimed longest page, California Proposition 218 (1996), claimed 50,122 words, WordPerfect counts 60,142 words (an over-estimate because it's counting Wikipedia navigation and other extra stuff). For Wikipedia's actual longest page, 1918 New Year Honours, the WordPerfect count is 139,671, more than twice the count for California Proposition 218 (1996). Language Log, you completely messed this up.

    [(myl) Hmm. I used WikiExtractor.py to get text from the Wikipedia database dump, and then I counted words in the resulting version of the articles. That code generally works well, but in the case of articles like "1918 New Year Honours", which is mostly just a very long list of nested lists, it seems to omit most of the material below the first level of the lists, yielding a version of the article with only 437 words. I'm not sure whether this is a bug or a feature — since I was using it to get text samples, it's probably a feature in this case, though it obviously raises issues in ranking articles by length. (My version of WikiExtractor.py is also not the latest one. But as I said, it generally seems to do the right thing in terms of extracting text from the dumps.)

    And in fairness to WikiExtractor.py, the articles like "1918 New Year Honours" are mainly just a long list of names. The list-of-characters articles like "List of Warriors characters", despite the name, include long textual descriptions of individual characters.]

  13. cs said,

    May 19, 2016 @ 7:32 am

    I like that the list of Dutch Inventions has a flag at the top saying "this article may be too long" and then a note saying "This list is incomplete; you can help by expanding it".

  14. D.O. said,

    May 19, 2016 @ 10:59 am

    Seems like some people dump their dissertations into Wiki articles. Doesn't seem a reasonably effective way to communicate about your subject.

  15. Jerry Friedman said,

    May 19, 2016 @ 11:13 am

    D.O.: The phenomenon I'm more familiar with is people dumping their less-than-graduate-level obsessions into Wikipedia articles. In my case, of course, the obsessions are harmless and mild.

  16. Michael said,

    May 19, 2016 @ 11:19 am

    By either count, the California Prop 218 one is now long enough to pass as a NaNoWriMo novel. Useful for people who don't believe in entering their novels into the counting system to know. They can just copy and paste from Wikipedia!

  17. Denis Moskowitz said,

    May 19, 2016 @ 12:11 pm

    Now I'm imagining a novelist who is as obsessed with taxation in California in the 1990s as Melville was about whaling…

    [(myl) Maybe the next big thing from Thomas Pynchon? Or his spiritual heir?]

  18. pago said,

    May 19, 2016 @ 1:05 pm

    My personal favorite bizarrely long and detailed Wikipedia article is the one on toilet paper orientation. I'm not sure which of the above theories explains that one, though.

  19. maidhc said,

    May 19, 2016 @ 4:00 pm

    If the byte count order differs from the word count order, doesn't it indicate that authors on certain topics tend to use longer words?

  20. Chris C. said,

    May 19, 2016 @ 5:21 pm

    @pago — That article isn't especially long, just longer than you'd expect for the subject it covers. In that it's exceptionally characteristic of how Wikipedia works: The amount of detail on any subject is directly related to how interested editors are in it, not in how important that subject is more broadly.

    Those of us of a certain age might remember the controversy raging among correspondents to the Ann Landers advice column back in the day, which is mentioned in the article lead. At the end of the day this is an utterly trivial matter, but people feel very strongly about it. Any such subject is likely to be covered in Wikipedia by an article that develops into something bizarrely long.

  21. ohwilleke said,

    May 19, 2016 @ 11:57 pm

    One of the things that the sample list illustrates is that Wikipedia is extraordinarily good at providing reference material styled coverage of popular culture that isn't remotely rivaled in any other medium.

    None of the articles in question involve music genres, but Wikipedia also does a stellar job of cataloguing, analyzing, name dropping and tracing lines of cultural influence for every tiny music sub-genre that has existed in the 20th century. Even the most expert music reviewer for a hip world class city's popular culture magazine would be hard pressed to scratch the surface of this data set.

  22. January First-of-May said,

    May 20, 2016 @ 7:23 am

    @ohwilleke – they're awful at online media though (in particular, they have almost nothing on webcomics).

    TV Tropes, though admittedly with a completely different structure, is a lot more comprehensive in that category.

  23. John Finkbiner said,

    May 20, 2016 @ 10:06 am

    @January First-of-May,

    There was an enormous purge of articles about web comics, ostensibly because they were not "significant" enough for a group of editors (or one editor? I forget.) I don't know the current state of play, but person(s) doing the deleting had enough pull to prevent the articles from being restored unless there was evidence of an enormous reader base.

  24. Madeleine Ball said,

    May 22, 2016 @ 8:11 am

    You might find it interesting to compare this to traffic statistics: http://stats.grok.se/ and https://dumps.wikimedia.org/other/pagecounts-raw/

  25. pj said,

    May 22, 2016 @ 10:00 am

    For a glimpse of a much simpler and less overwhelming world of knowledge, the Scots version of Wikipedia is a pleasure.

    For instance, 'Tree' in English: novella-length article full of science, stats and technical vocabulary. 'Tree' in Scots: soothing.
    Likely to enhance the non-speaker's knowledge of Scots grammar and orthography, though, if not necessarily their knowledge of trees.

  26. pj said,

    May 22, 2016 @ 10:01 am

    Oh, sorry, something went awry with that second link: here.

  27. codetaku said,

    May 25, 2016 @ 9:59 am

    Strongly agree with AG. The "List of Pokemon" (where each pokemon comes with at least a brief description, and often much more detail) is broken up into over a dozen different articles, including several articles dedicated to singular pokemon of cultural significance. When combined it would be many times longer than the current longest wikipedia article.

RSS feed for comments on this post