Thanks, Bill Dunn!

In a comment on a recent LL post, Daniel C. Parmenter wrote:

In my MT days (starting in the early nineties) we used the WSJ corpus a lot. I read recently that the availablity of this corpus was in no small part thanks to you. And so I thank you. In those pre-and-early Google/Altavista days the WSJ corpus was an enormous help. Thanks!

Daniel is referring to an archive of text from the Wall Street Journal, covering 1987-1989, originally published with some other raw material for corpus linguistics by the  Data Collection Initiative of the Association for Computational Linguistics (ACL/DCI). And the person who most deserves thanks for the availability of the WSJ part of this publication — perhaps its most important part — is Bill Dunn, who was the head of Dow Jones Information Services in the late 1980s.

As far as I know, Bill's role in making this corpus available is not documented anywhere, so I'll take this opportunity to tell some of the story as I remember it. (The rest of this post is a slightly-edited version of an email that I sent on 5/1/2008 to someone at the WSJ who had corresponded with Geoff Pullum about an article on the use of corpus materials in linguistic research.)

In the mid 1980s, Bill Dunn came to visit AT&T to talk about networked digital media — part of what we now call the World Wide Web, though of course the web didn't exist then. Bill was  the vice president for information services at Dow Jones & Company, and I was head of the linguistics research department at Bell Labs. Bill was convinced that in the future, people would get their information in digital form rather than on paper, via networked connections to information providers like Dow Jones. He felt that several sorts of technological innovation would be key to making that happen — changes in the network and in the devices that people connect to it with, but also in the way that information is stored, searched and presented.

Bill hoped that AT&T could help him with the network and the devices. I was most interested in the storage, searching and presentation, and I made the argument that the best way for him to foster progress in that area would be to make a body of WSJ text easily available in digital form to researchers around the world.

He agreed, and told some technicians at the DJIS site in Princeton to send me a few cartons of those old nine-track tapes. I read and decrypted them (they contained instructions for some antique typographical engine, as I recall, so this was not entirely trivial), and the contents were featured in a series of collections made available on CD-ROM via the Data Collection Initiative of the Association for Computational Linguistics, which was founded for the purpose. (And that's another story…)

Anyhow, I recently searched the web to find out what happened to Bill, whom I haven't spoken with in 20 years, and I found this (Juan Antono Giner, "From Newspapers to 24-hour information engines", 10/2001):

In the early 1990s, digital language emerged as a new matrix to unify the traditional differentiation of the media. It was the start of a development that hitherto was impossible: electronification of the entire media — print or audiovisual.

From an era of co-existence we moved to a culture of cooperation. Although the transition from the analog world to a digital one called for strategies that were still passive, these new companies — such as Japan's Nikkei Group and Brasil's Agência Estado, which were pioneers of this convergence — became "post-newspaper" organisations.

Not everyone agreed on what lines to follow. William Dunn, vice president for electronic information services at Dow Jones & Company, was convinced that the print edition of The Wall Street Journal would soon be only one of many sources of revenue. His directors were skeptical, and Dunn left the company. Dow Jones later failed with its Telerate venture and lost its leadership position in world real-time financial services. Bloomberg, at the time an unknown but visionary news agency, came to the fore in less than a decade. So did Reuters, which reinvented itself in short order as a provider of content in digital multi-channels.

I regret to say that AT&T management was not any nimbler or more prescient in this respect than DJ&C was — in their view at the time, the key technical problem was how to make a cheap-enough piece of hardware combining a modem, a printer, and a cassette recorder, so that subscribers could download a personalized news feed in the wee hours of the morning (when bandwidth was essentially free), and have their choice of a printed or spoken version waiting at breakfast time.

They didn't understand that the most serious problems were editorial and human-factors problems: how to let users set up their profiles, how to match stories to their profiles cheaply and reliably enough (or let them search for stories conveniently enough), how to get the necessary stories read (or synthesized) at high enough quality for the audio version, etc. (At least, the people I dealt with at the time weren't interested in these questions.)

Anyhow, Bill understood that for things like this to work, all sorts of new search and retrieval and user-interface technology would need to be developed, and that in order to develop the methods, researchers would need large bodies of real text to work on. He got someone to send me tapes of three  years of Dow Jones newswire before I left AT&T in 1990.

Later on, in 1992 or so when a DARPA project needed more text, I tried to reach Bill again. I believe that he had retired by then; and his successors were frankly horrified that he had handed out so much stuff on a handshake. I don't think there were even any records at Dow Jones that the release had happened — he'd just asked one of the computer operators in Princeton to make me a dump.

Anyhow, the guy in change at that point, Peter Shuyten, was smart enough to recognize that  we could probably be trusted, given that no IPR disasters had occurred up that point, and was kind enough decide to give us more text rather than taking us to court — although I think that this time, we paid for it, at least at the standard newswire subscription rates.

Computational linguistics owes Bill Dunn a lot, and (I think) so does the world at large. Thanks, Bill!


  1. Daniel C. Parmenter said,

    August 6, 2009 @ 1:09 pm

    Thanks for the information! I had never heard any of this before and it was quite interesting to read this history. So thank you too Bill Dunn!

    Working with that corpus was quite fun and I came away with a lot of respect for the WSJ writers. Moving from testing our parser with WSJ text to testing text that we found on the 'net was a rude awakening to say the least, but also a fun and challenging task. I miss my days in MT more than I can say.

  2. Nathan Myers said,

    August 6, 2009 @ 4:29 pm

    Did I miss where "MT" was expanded?

    [(myl) Apparently :-). It stands for "machine translation".]

  3. Daniel C. Parmenter said,

    August 6, 2009 @ 5:08 pm

    Yeah, see the previous post for the full context. But I didn't bother to explain what "WSJ" meant in that post, so I'm still guilty of insufficient expansion of acronyms.

  4. Nathan Myers said,

    August 6, 2009 @ 6:04 pm

    Sure enough, there it is. I looked there, but missed it. My first guess at what "machine translation" means would be "operating a forklift", but forklifts do rotations about Z, too. I suppose affine jokes are commonplace in the MT crowd.

  5. Bill Dunn said,

    October 15, 2009 @ 1:59 pm

    To Mark Lieberman:

    You are welcome !

    Bill Dunn

  6. Bill Dunn said,

    October 15, 2009 @ 2:00 pm

    To Daniel Parmenter:

    And you too are welcome.

    Bill Dunn

  7. Tom Dalessio said,

    January 2, 2010 @ 5:49 pm

    For more than thirty years I worked for Dow Jones, more specifically the Wall Street Journal, retiring in 2002. The most rewarding years with the company were spent working for Bill Dunn.

    Bill saw the increase in postal rates mounting in the early 1970’s and established a subsidiary of Dow Jones called National Delivery Service. This company was tasked to take the majority of the Journal’s 1 million subscriber copies out of the Postal Service and hand deliver them to subscribers. This accomplished two things. It improved service thus increasing circulation and provided savings in distribution costs.

    Bill was the president of National Delivery Service and me and two others Kurt Olson and Mike Marvaso worked for him to set up and operate the company. Bill’s determination and guidance was fantastic. He was a true visionary who gave those working for him a sense of purpose and self confidence that would last throughout their DJ careers. Bill knew how to work hard and how to play hard after each mission was successfully accomplished.

    Thank you Bill Dunn. My years with NDS under you guidance were the most rewarding and enjoyable of any with Dow Jones and it is a true shame that DJ did not follow your lead.

    Tom Dalessio

