McCrum's 100 best ways to ruin the 4th of July

The many Americans in the University of Edinburgh's community of language and information scientists had to celebrate the glorious 4th on the 3rd this year, because the 4th is an ordinary working Monday. I attended a Sunday-afternoon gathering kindly hosted by the Head of the School of Informatics, Johanna Moore. We barbecued steadfastly in the drizzle despite classic Scottish indecisive summer weather: it was cloudy, well under 60°F. Twice we all had to flee inside indoors when the rain became heavier. No matter: we chatted together and enjoyed ourselves. (I swore in 2007 that one thing I was not going to do was spend my time in this bracing intellectual environment grumbling about how the weather in Santa Cruz had been better. I'm here for the linguistic science, not the weather.) So it was a happy Fourth of July for me. Until this morning, the actual 4th, when people started emailing me (thanks, you sadistic bastards) to note that Robert McCrum had chosen America's independence day to make his choice for the 23rd in a series called "The 100 Best Nonfiction Books of All Time," in the British newspaper The Observer. He chooses The Elements of Style by William Strunk and E. B. White. For crying out loud!

Read the rest of this entry »

Comments off


Spelling with Chinese character(istic)s, pt. 4

The last installment of this series, "Spelling with Chinese character(istic)s, pt. 3" (6/30/16), contains links to many other Language Log posts relevant to this subject.

It is often difficult to fathom which English word is intended when it is transcribed in Chinese characters.  John Kieschnick called my attention to an especially challenging one:  ěrlílìjǐng 爾釐利景.  Before going on to the next page and before googling it, try to figure out what it is meant to "spell".  Scout's honor!  No peeking!

Read the rest of this entry »

Comments (10)


Ex-physicist takes on Heavy Metal NLP

"Heavy Metal and Natural Language Processing – Part 1", Degenerate State 4/20/2016:

Natural language is ubiquitous. It is all around us, and the rate at which it is produced in written, stored form is only increasing. It is also quite unlike any sort of data I have worked with before.

Natural language is made up of sequences of discrete characters arranged into hierarchical groupings: words, sentences and documents, each with both syntactic structure and semantic meaning.

Not only is the space of possible strings huge, but the interpretation of a small sections of a document can take on vastly different meanings depending on what context surround it.

These variations and versatility of natural language are the reason that it is so powerful as a way to communicate and share ideas.

In the face of this complexity, it is not surprising that understanding natural language, in the same way humans do, with computers is still a unsolved problem. That said, there are an increasing number of techniques that have been developed to provide some insight into natural language. They tend to start by making simplifying assumptions about the data, and then using these assumptions convert the raw text into a more quantitative structure, like vectors or graphs. Once in this form, statistical or machine learning approaches can be leveraged to solve a whole range of problems.

I haven't had much experience playing with natural language, so I decided to try out a few techniques on a dataset I scrapped from the internet: a set of heavy metal lyrics (and associated genres).

[h/t Chris Callison-Burch]

Comments (6)


A new polysyllabic character

Comments (6)


"Linguistics has evolved"

From alice-is-thinking on tumblr, three weeks ago, forwarded by a 20-year-old correspondent:

https://www.tumblr.com/alice-is-thinking/145533947099/me-10-years-ago-i-never-use-online-abbreviations

The accompanying note:

this seems to be a rly common phenomenon among millennials who are especially active on social media – myself included

Read the rest of this entry »

Comments (14)


Sino-Japanese

I recall that, as a graduate student in Sinology, one of the most troublesome tasks was figuring out how to romanize the names of Japanese authors, the titles of their works, place names, technical terms, and so forth. Overall, Japanese Sinological (not to mention Indological and other fields) scholarship is outstanding, so we have to consult it, and when we cite Japanese works, we need to be able to romanize names, titles, and so forth to reflect their Japanese pronunciations.

Read the rest of this entry »

Comments (27)


Post Office nerdview (capped)

Postal orders are a way for people in Britain to send money by post without having a checking account, but there is a fee, dependent on the face value of the order. For a postal order with a face value of more than £100 the fee is shown on the Post Office web page as "Capped at £12.50", which puzzled Matt Keefe. He wrote to me to ask if it was an instance of nerdview. Absolutely; that's exactly what it is.

Read the rest of this entry »

Comments off


Corpus-based judicial opinions

Gordon Smith, "Michigan Supreme Court Embraces Corpus Linguistics", The Conglomerate 6/28/2016:

In the case of People v. Harris, the Michigan Supreme Court became the first state supreme court in the United States to embrace corpus linguistics. (I have written here about Justice Thomas Lee's concurrence in the Utah Supreme Court's Rasabout case, which is cited in this Michigan opinion.) The consolidated cases relate to the "Disclosures by Law Enforcement Officers Act" (DLEOA), which bars the use in a subsequent criminal proceeding of all "information" provided by a law enforcement officer under threat of any employment sanction. While the act does not distinguish between true and false statements, the court used corpus analysis to investigate whether "information" must be true. The majority concludes, "false or inaccurate information cannot be used against a law enforcement officer in subsequent criminal proceedings. To hold otherwise would defeat the Legislature's stated intent…."

Read the rest of this entry »

Comments (3)


Hillary's "sigh"

Eric Garland of The Hill shares a video of Hillary Clinton at a June 22 campaign appearance in North Carolina, and it provides ammunition for those who would like to portray her as a soulless automaton vainly trying to seem like an authentic human being.

https://vine.co/v/5zdrHezXlbV

Read the rest of this entry »

Comments (35)


That false and senseless Way of Speaking

Some eloquent 17th-century Quaker peeving, from The history of the life of Thomas Ellwood: Or, an account of his birth, education, &c. with divers observations on his life and manners when a youth: … Also several other remarkable passages and occurrences. Written by his own hand. To which is added, a supplement by J. W., 1714:

Again, The Corrupt and Unfound Form of Speaking in the Plural Number to a Single Person (Y O U to One, instead of T H O U ;) contrary to the Pure, Plain and Single Language of T R U T H  T H O U to One, and Y O U to more than One) which had always been used, by G O D to Men, and Men to G O D, as well as one to another, from the oldest Record of Time, till Corrupt Men, for Corrupt Ends, in later and Corrupt Times, to Flatter, Fawn, and work upon the Corrupt Nature in Men, brought in that false and senseless Way of Speaking, Y O U to One ; which hath since corrupted the Modern Languages, and hath greatly debased the Spirits, and depraved the Manners of Men. This Evil Custom I had been as forward in as others and this I was now called out of, and required to cease from.

Read the rest of this entry »

Comments (38)


Sayable and now writable

In a comment to "Pinyin literature contest" (6/30/16), DG asked an excellent, reasonable question:

I am not a Chinese speaker, so I am wondering if the requirement that it's not originally written in Chinese characters is a sort of honor code, or is there some way to tell from the pinyin submission?

Read the rest of this entry »

Comments (8)


Lapsus linguae

Yesterday I gave a lecture on the Bronze Age and Early Iron Age "mummies" (they're really desiccated corpses, but "mummies" sounds cuter) of Eastern Central Asia before an audience of about twenty-five at the Franklin Inn Club in Philadelphia.

Read the rest of this entry »

Comments (7)


Private probably

The following two images come from Graham and Kathleen's video diary of a trip to the Daitoku-ji temple complex in Kyoto.


Read the rest of this entry »

Comments (7)