Language Log

Archive for Computational linguistics

Electric sheep

April 18, 2017 @ 4:16 am· Filed by Mark Liberman under Computational linguistics, Elephant semifics

A couple of recent LLOG posts ("What a tangled web they weave", "A long short-term memory of Gertrude Stein") have illustrated the strange and amusing results that Google's current machine translation system can produce when fed variable numbers of repetitions of meaningless letter sequences in non-Latin orthographic systems. [Update: And see posts in the elephant semifics category for many other examples.] Geoff Pullum has urged me to explain how and why this sort of thing happens:

I think Language Log readers deserve a more careful account, preferably from your pen, of how this sort of craziness can arise from deep neural-net machine translation systems. […]

Ordinary people imagine (wrongly) that Google Translate is approximating the process we call translation. They think that the errors it makes are comparable to a human translator getting the wrong word (or the wrong sense) from a dictionary, or mistaking one syntactic construction for another, or missing an idiom, and thus making a well-intentioned but erroneous translation. The phenomena you have discussed reveal that something wildly, disastrously different is going on.

Something nonlinear: 18 consecutive repetitions of a two-character Thai sequence produce "This is how it is supposed to be", and so do 19, 20, 21, 22, 23, and 24, and then 25 repetitions produces something different, and 26 something different again, and so on. What will come out in response to a given input seems informally to be unpredictable (and I'll bet it is recursively unsolvable, too; it's highly reminiscent of Emil Post's famous tag system where 0..X is replaced by X00 and 1..X is replaced by X1101, iteratively).

Type "La plume de ma tante est sur la table" into Google Translate and ask for an English translation, and you get something that might incline you, if asked whether you would agree to ride in a self-driving car programmed by the same people, to say yes. But look at the weird shit that comes from inputting Asian language repeated syllable sequences and you not only wouldn't get in the car, you wouldn't want to be in a parking lot where it was driving around on a test run. It's the difference between what might look like a technology nearly ready for prime time and the chaotic behavior of an engineering abortion that should strike fear into the hearts of any rational human.

Language Log needs at least a sketch of a proper serious account of what's going on here.

A sketch is all that I have time for today, but here goes…

Read the rest of this entry »

Permalink Comments (38)

A long short-term memory of Gertrude Stein

April 16, 2017 @ 3:07 pm· Filed by Mark Liberman under Computational linguistics, Elephant semifics, Language and culture

As just observed ("What a tangled web they weave"), successive repetitions of short sequences of Japanese, Korean, Thai (and perhaps other types of) characters cause Google's Neural Machine Translation system to generate surprisingly varied and poetic English equivalents.

Thus if we repeat 1 through 25 times the two-character Thai sequence ไๅ

|ไ| 0x0E44 "THAI CHARACTER SARA AI MAIMALAI"
|ๅ| 0x0E45 "THAI CHARACTER LAKKHANGYAO"

the system, "a deep LSTM network with 8 encoder and 8 decoder layers using attention, residual connections, and trans-temporal chthonic affinity", establishes a pretty solid spiritual connection with Gertrude Stein:

Read the rest of this entry »

Permalink Comments (14)

What a tangled web they weave

April 15, 2017 @ 11:04 pm· Filed by Mark Liberman under Computational linguistics, Elephant semifics, Humor

…when neural nets are recursive:

Read the rest of this entry »

Permalink Comments (31)

Country list translation oddity

April 10, 2017 @ 5:04 pm· Filed by Mark Liberman under Computational linguistics, Elephant semifics, WTF

This is weird, and even slightly creepy — paste a list of countries like

Costa Rica, Argentina, Belgium, Bulgaria, Canada, Chile, Colombia, Dominican Republic, Ecuador, El Salvador, Ethiopia, France, Germany, England, Guatemala, Honduras, Italy, Israel, Mexico, New Zealand, Nicaragua, Peru, Puerto Rico, Scotland, Switzerland, Spain, Sweden, Uruguay, Venezuela, USA

into Google Translate English-to-Spanish, and a parallel-universe list emerges:

Read the rest of this entry »

Permalink Comments (22)

Advances in birdsong modeling

April 1, 2017 @ 6:57 am· Filed by Mark Liberman under Biology of language, Computational linguistics

Eve Armstrong and Henry Abarbanel, "Model of the songbird nucleus HVC as a network of central pattern generators", Journal of neurophysiology, 2016:

We propose a functional architecture of the adult songbird nucleus HVC in which the core element is a "functional syllable unit" (FSU). In this model, HVC is organized into FSUs, each of which provides the basis for the production of one syllable in vocalization. Within each FSU, the inhibitory neuron population takes one of two operational states: (A) simultaneous firing wherein all inhibitory neurons fire simultaneously, and (B) competitive firing of the inhibitory neurons. Switching between these basic modes of activity is accomplished via changes in the synaptic strengths among the inhibitory neurons. The inhibitory neurons connect to excitatory projection neurons such that during state (A) the activity of projection neurons is suppressed, while during state (B) patterns of sequential firing of projection neurons can occur. The latter state is stabilized by feedback from the projection to the inhibitory neurons. Song composition for specific species is distinguished by the manner in which different FSUs are functionally connected to each other.

Ours is a computational model built with biophysically based neurons. We illustrate that many observations of HVC activity are explained by the dynamics of the proposed population of FSUs, and we identify aspects of the model that are currently testable experimentally. In addition, and standing apart from the core features of an FSU, we propose that the transition between modes may be governed by the biophysical mechanism of neuromodulation.

Read the rest of this entry »

Permalink Comments off

"Bare-handed speech synthesis"

March 22, 2017 @ 6:38 pm· Filed by Mark Liberman under Computational linguistics

This is neat: "Pink Trombone", by Neil Thapen.

By the same author — doodal:

Permalink Comments (4)

Court fight over Oxford commas and asyndetic lists

March 19, 2017 @ 12:28 am· Filed by Jason Eisner under Computational linguistics, coordination, Grammar, Language and the law, Linguistics in the news, Parsing, Punctuation, Usage

Language Log often weighs in when courts try to nail down the meaning of a statute. Laws are written in natural language—though one might long, by formalization, to end the thousand natural ambiguities that text is heir to—and thus judges are forced to play linguist.

Happily, this week's "case in the news" is one where the lawyers managed to identify several relevant considerations and bring them to the judges for weighing.

Most news outlets reported the case as being about the Oxford comma (or serial comma)—the optional comma just before the end of a list. Here, for example, is the New York Times:

Lack of Oxford Comma Could Cost Maine Company Millions in Overtime Dispute (news article)
For Want of a Comma (opinion piece)

Read the rest of this entry »

Permalink Comments (20)

What's hot at ICASSP

March 9, 2017 @ 3:40 pm· Filed by Mark Liberman under Computational linguistics

This week I'm at IEEE ICASSP 2017 in New Orleans — that's the "Institute of Electrical and Electronics Engineers International Conference on Acoustics, Speech and Signal Processing". pronounced /aɪ 'trɪ.pl i 'aɪ.kæsp/. I've had joint papers at all the ICASSP conferences since 2010, though I'm not sure that I've attended all of them.

This year the conference distributed its proceedings on a nifty little guitar-shaped USB key, which I promptly copied to my laptop for easier access. I seem to have deleted my local copies of most of the previous proceedings, but ICASSP 2014 escaped the reaper, so I decided to while away the time during one of the many parallel sessions here by running all the .pdfs (1703 in 2014, 1316 this year) through pdftotext, removing the REFERENCE sections, tokenizing the result, removing (some of the) unwordlike strings, and creating overall lexical histograms for comparison. The result is about 5 million words for 2014 and about 3.9 million words this year.

And to compare the lists, I used the usual "weighted log-odds-ratio, informative Dirichlet prior" method, as described for example in "The most Trumpish (and Bushish) words", 9/5/2015.

Read the rest of this entry »

Permalink Comments (2)

The shape of a LibriVox phrase

March 5, 2017 @ 5:53 am· Filed by Mark Liberman under Computational linguistics

Here's what you get if you align 11 million words of English-language audiobooks with the associated texts, divide it all into phrases by breaking at silent pauses greater than 150 milliseconds, and average the word durations by position in phrases of lengths from one word to fifteen words:

The audiobook sample in this case comes from LibriSpeech (see Vassil Panayotov et al., "Librispeech: An ASR corpus based on public domain audio books", IEEE ICASSP 2015). Neville Ryant and I have been collecting and analyzing a variety of large-scale speech datasets (see e.g. "Large-scale analysis of Spanish /s/-lenition using audiobooks", ICA 2016; "Automatic Analysis of Phonetic Speech Style Dimensions", Interspeech 2016), and as part of that process, we've refactored and realigned the LibriSpeech sample, resulting in 5,832 English-language audiobook chapters from 2,484 readers, comprising 11,152,378 words of text and about 1,571 hours of audio. (This is a small percentage of the English-language data available from LibriVox, which is somewhere north of 50,000 hours of English audiobook at present.)

Read the rest of this entry »

Permalink Comments (8)

Quantifying Donald Trump's rhetoric

February 7, 2017 @ 2:50 pm· Filed by Mark Liberman under Computational linguistics, Rhetoric

David Beaver & Jason Stanley, "Unlike all previous U.S. presidents, Trump almost never mentions democratic ideals", Washington Post 2/7/2017:

The central norms of liberal democratic societies are liberty, justice, truth, public goods and tolerance. To our knowledge, no one has proposed a metric by which to judge a politician’s commitment to these democratic ideals.

A direct way suggested itself to us: Why not simply add up the number of times those words and their synonyms are deployed? If the database is large enough, this should provide a rough measure of a politician’s commitment to these ideals. How does Trump’s use of these words compare to that of his presidential predecessors?

At Language Log, the linguist Mark Liberman graphed how unusual Trump’s inaugural speech was, graphing the frequency of critical words used in each of the past 50 years’ inaugural speeches — and showing how much more nationalist language, and how much less democratic language Trump used than did his predecessors.

We expanded this project, looking at the language in Trump’s inaugural address as well as in 61 campaign speeches since 2015. We compared that to the language used in all 57 prior inaugural speeches, from George Washington’s on. The comparison gives us a picture of Trump’s rhetorical emphases since his campaign began, and hence of his most deeply held political ideals.

Permalink Comments (9)

"Finding a voice"

January 5, 2017 @ 11:54 am· Filed by Mark Liberman under Computational linguistics

An excellent article by Lane Greene: "Language: Finding a voice", The Economist 1/5/2017.

Permalink Comments (10)

Twitter-based word mapper is your new favorite toy

December 15, 2016 @ 10:06 pm· Filed by Ben Zimmer under Awesomeness, Computational linguistics, Dialects, Variation

At the beginning of 2016, Jack Grieve shared the first iteration of the Word Mapper app he had developed with Andrea Nini and Diansheng Guo, which let users map the relative frequencies of the 10,000 most common words in a big Twitter-based corpus covering the contiguous United States. (See: "Geolexicography," "Totally Word Mapper.") Now as the year comes to a close, Quartz is hosting a bigger, better version of the app, now including 97,246 words (all occurring at least 500 times in the corpus). It's appropriately dubbed "The great American word mapper," and it's hella fun (or wicked fun, if you prefer).

Some misc word maps made with the word mapper https://t.co/R2wCfnsegx pic.twitter.com/6jiDdJN5QL

— nikhil sonnad (@nkl) December 15, 2016

Read the rest of this entry »

Permalink Comments (21)

"The people that stayed back the facts"

December 3, 2016 @ 6:07 pm· Filed by Mark Liberman under Computational linguistics

This is a reality check on the current state of automatic speech recognition (ASR) algorithms. I took the 186-word passage by Scottie Nell Hughes discussed in yesterday's post, and submitted it to two different Big-Company ASR interfaces, with amusing results. I'll be interested to see whether other systems can do better.

Read the rest of this entry »

Permalink Comments (6)

« Previous Page — « Previous Entries

Next Entries » — Next Page »

Archive for Computational linguistics

Electric sheep

A long short-term memory of Gertrude Stein

What a tangled web they weave

Country list translation oddity

Advances in birdsong modeling

"Bare-handed speech synthesis"

Court fight over Oxford commas and asyndetic lists

What's hot at ICASSP

The shape of a LibriVox phrase

Quantifying Donald Trump's rhetoric

"Finding a voice"

Twitter-based word mapper is your new favorite toy

"The people that stayed back the facts"

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta