The immortal Pierre Vinken
« previous post | next post »
On November 7, publishers Reed Elsevier announced the passing of Pierre Vinken, former Reed Elsevier CEO and Chairman, at age 83. But to those of us in natural language processing, Mr. Vinken is 61 years old, now and forever.
Though I expect it was unknown to him, Mr. Vinken has been the most familiar of names in natural language processing circles for years, because he is the subject (in both senses, not to mention the inaugural bigram) of the very first sentence of the Wall Street Journal (WSJ) corpus:
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
But there's a fascinating little twist that most NLPers are probably not aware of. I certainly wasn't.
It is hard to imagine the long nights spent, the whiteboards filled, the ink spilled by NLP researchers who, when they first started working on whatever they were working on, saw that sentence first of all. Actually, more likely, they saw this:
( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken)) (, ,) (ADJP (NML (CD 61) (NNS years)) (JJ old)) (, ,)) (VP (MD will) (VP (VB join) (NP (DT the) (NN board)) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29)))) (. .)))
If you search for the sentence in Google, you get back somewhere around 6000 hits. The results are a delightful smorgasbord of work in NLP, from directions on using tools like OpenNLP tokenizer and NLTK to academic papers and presentations, to online discussions, to data used in homework assignments.
It was therefore really neat to discover this, in a history of Elsevier publishing:
… Thus began a decade of mergers for Elsevier. The first came in 1970 when Elsevier and North Holland merged, a natural partnering of two friendly and complementary businesses. Next, in 1971, came a merger with Excerpta Medica an old and innovative medical abstracts company. Excerpta Medica, an international medical publishing company in business since the 1940s, was currently in the process of creating an electronic database that would contain abstracts of all medical literature published in all languages. That database, born of neurosurgeon-turned-publisher Pierre Vinken's personal notes, was ultimately to become EMBASE, a product which Elsevier launched in 1972 in its first venture into the field of information technology.
Yup. Pierre Vinken, who passed at 83, but is immortal at 61 years old, was not just content in a corpus, and not just a neurosurgeon-turned-publisher. He was an early corpus builder, too. How about that.
[Hat tip to Adam Lopez for passing on the news of Vinken's passing. Rebecca Hwa's comment in response, "He will always remain 61 years old in my heart", inspired this post. And, as Noah Smith adds, "'His name will always be connected with that of Reed Elsevier.' … in every statistical parser ever trained for English, until the end of time." Appropriate for a guy born in a place called Treebeek… (additional hat tip to David Bamman for that one).]
Noah Smith said,
December 2, 2011 @ 11:39 am
Technically, Vinken's name would be associated with "Elsevier N.V.," per the second sentence: "Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group." (And only in fairly high-order, lexicalized parsing models.) We NLPers just can't seem to get away from mergers and acquisitions, can we?
Further, because these sentences are found in Section 00, they aren't actually used in training. So, despite his influence on the field, Mr. Vinken's name must remain Out Of Vocabulary in parsers trained according to the longstanding train/dev/test split.