Language Log

Mrs. Transformer-XL Tittlemouse

December 18, 2019 @ 8:19 pm · Filed by Mark Liberman under Computational linguistics, Elephant semifics

This is another note on the amazing ability of modern AI learning techniques to imitate some aspects of natural-language patterning almost perfectly, while managing to miss common sense almost entirely. This probably tells us something about modern AI and also about language, though we probably won't understand what it's telling us until many years in the future.

Today's example comes from Zihang Da et al., "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context", arXiv 6/2/2019.

The abstract:

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term
dependency, but also resolves the context fragmentation problem. As a result, TransformerXL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-ofthe-art results of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens. Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch.

The example, from Appendix E ("Generated Text"):

= The Tale of Mrs. Tittlemouse =
The Tale of Mrs. Tittlemouse is a children’s book written and illustrated by H. M. Potter. It was first published in February 1910 by Frederick Warne & Co. and included illustrations by Frederick Warne. Mrs. Tittlemouse was intended to be a children’s book, rather than a musical, and the tale describes her attempts to find her own home. Mrs. Tittlemouse is an average kitten and the tale follows a mischievous rabbit named Mr. Tittlemouse who tries to kill her in an attempt to get a ride on a horse. Potter later wrote of the rabbit and her attempts to kill her, " If [ she ] were a fox, I should like to have been able to show her the way. " Potter’s work was inspired by The Tale of Little Black Sambo and her two nieces, Petunia and Petunia. It was the first book in the Potter family to be published in a hardback format, and was also the first book in the Peter Rabbit series. The tale was released on 13 November 1910 and was accompanied by a greeting card by Potter’s brother-in-law, Frederick Warne. The tale was well received by critics and children, and sold over 600,000 copies in its first few years. In the first year of its publication, Potter’s The Tale of Mrs. Tittlemouse won children’s book awards in England, America and Australia. It was selected as a " recommended reading " by children in the US and Canada.
= = Plot summary = =
The tale opens with the pet rabbit Mrs. Tittlemouse wandering through a forest in a small litter of four. He is shot and badly injured by a deer after she attempts to escape. Mrs. Tittlemouse decides that she wants to find her own home, because she is eager to go on her own. She goes alone to the farm where she makes a little money by selling a few seeds and building a small cabin in the woods. She is approached by a wealthy hunter named Mr. Tittlemouse, who tries to kill her but Mrs. Tittlemouse kills him by stuffing a rope into his nose and killing him. She is rescued by Mr. Tittlemouse’s wife Ruth, but Mrs. Tittlemouse then leaves the woodland with the baby. When she is spotted by

The authors' explanation:

500 tokens generated by XL using a snippet from the Wikitext-103 test set as initial context. The sample is randomly generated without any cherry picking. Original Wikipedia page: https://en.wikipedia.org/wiki/The_Tale_of_Mrs._Tittlemouse.

This example exhibit some additional interesting properties of Transformer-XL:

- After finishing the last paragraph of the seed context, both the reference and generated text start a new topic (i.e., Wikipedia page), as marked by the single “= title =” line. This suggests the model has the ability of identifying the end of a topic / page, and randomly starting with a new topic.
- Even more interestingly, a newly-started page is on a book called “The Tale of Mrs. Tittlemouse”. Transformer-XL manages to copy the same book title and some related information from the training set, but hallucinates novel content of the book. This demonstrates a degree of generalization instead of memorization.

If I've read the paper right ("WikiText-103 is the largest available word-level language modeling benchmark with long-term dependency"), their input tokens are wordforms rather than letters, which would explain the lack of pseudo-words like "Shelties" and "kakakew".

[h/t Neville Ryant]

December 18, 2019 @ 8:19 pm · Filed by Mark Liberman under Computational linguistics, Elephant semifics

Permalink

5 Comments

Michael Watts said,

December 19, 2019 @ 12:12 am

If this is "reasonably coherent", something's wrong with the calibration of "reasonable".

Mrs. Tittlemouse is an average kitten and the tale follows a mischievous rabbit named Mr. Tittlemouse who tries to kill her in an attempt to get a ride on a horse.

= = Plot summary = =

The tale opens with the pet rabbit Mrs. Tittlemouse wandering through a forest in a small litter of four. He is shot and badly injured by a deer after she attempts to escape.
Kristian said,

December 19, 2019 @ 9:03 am

This is the best computer generated text I have ever read. It's hilarious. There are too many good lines to quote, but my favorite detail is the "two nieces, Petunia and Petunia". And "the tale was released on 13 November 1910 and was accompanied by a greeting card by Potter's brother-in-law, Frederick Warne".
Bill Benzon said,

December 19, 2019 @ 9:59 am

"This probably tells us something about modern AI and also about language, though we probably won't understand what it's telling us until many years in the future."

Yes.

I've been thinking something like that for a couple of years now, and have even attempted to conceptualize it – a number of posts I've labeled with "AI Limit" are of this kind, particularly Computational linguistics & NLP and Borges Redux: Computing Babel. In his speech upon accepting an award from the ACL Martin Kay (PDF) notes that contemporary AI is using statistics over word distributions as a substitute/proxy for a model of the world. I think that's right. I note as well that the problem of common sense knowledge is one of the problems that put the breaks on old-style symbolic AI. That, of course, is a problem about modeling the (surface of) the world in all it's trivial but inescapable variety. There were just so many bits of it to hand-code and, once coded, all those trivial bits exacerbated the problem of combinatorial explosion.

It is thus interesting that these new techniques, run on machines that dwarf those machines from the 1970s and 1980s, are now running into the common sense problem. It is not at all obvious to me that the problem can be solved by using ever more text as fodder and more computing power to digest that fodder. The world is just too big and too irreducibly complex to be mastered in that way.

The other side of the issue is that these statistical techniques work very well in closed domains, like chess and Go. In those domains there is hardly any world to speak of and hence there is no common sense problem. Moreover, abstractly considered, those games are finite. Given enough time and memory it would be possible to calculate every possible game and then list them all. What's interesting is that the best chess programs seem to have broken into regions of the chess space that human players had not explored, so they exhibit new styles of play.

It's as though Go and chess embody the abstract mental powers we bring to bear on the world (Chomskyian generativity?) while the common sense problem, in effect, represents the resistance that the world presents to us. It is the world exerting its existence by daring us: "parse this, and this, and this, and…!"
Rick Rubenstein said,

December 19, 2019 @ 7:55 pm

Transformers have a potential of learning longer-term dependency.
This prospect is frightening. Let's not make the Decepticons any more powerful than they already are.
V said,

December 27, 2019 @ 10:17 pm

That seems superficially impressive, but also, as Bill Benzon said just a difference in scale. Nothing we were not doing 15 years ago with regards to language, but with more processors. And very narrow, in specific domains inside languege (or Chess, or Go.)

RSS feed for comments on this post

Mrs. Transformer-XL Tittlemouse

5 Comments

Michael Watts said,

Kristian said,

Bill Benzon said,

Rick Rubenstein said,

V said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta