Language Log

Lexical loops

October 1, 2012 @ 1:41 pm · Filed by Mark Liberman under Computational linguistics, Evolution of language

David Levary Jean-Pierre Eckmann, Elisha Moses, and Tsvi Tlusty, "Loops and Self-Reference in the Construction of Dictionaries", Phys. Rev. X 2, 031018 (2012):

ABSTRACT: Dictionaries link a given word to a set of alternative words (the definition) which in turn point to further descendants. Iterating through definitions in this way, one typically finds that definitions loop back upon themselves. We demonstrate that such definitional loops are created in order to introduce new concepts into a language. In contrast to the expectations for a random lexical network, in graphs of the dictionary, meaningful loops are quite short, although they are often linked to form larger, strongly connected components. These components are found to represent distinct semantic ideas. This observation can be quantified by a singular value decomposition, which uncovers a set of conceptual relationships arising in the global structure of the dictionary. Finally, we use etymological data to show that elements of loops tend to be added to the English lexicon simultaneously and incorporate our results into a simple model for language evolution that falls within the “rich-get-richer” class of network growth.

I haven't read the paper yet, much less thought about it. So, more later.

October 1, 2012 @ 1:41 pm · Filed by Mark Liberman under Computational linguistics, Evolution of language

Permalink

22 Comments

Core words : Leonardo Boiko’s background diary said,

October 1, 2012 @ 3:14 pm

[…] Hierarchies in Dictionary Deﬁnition Space, via Loops and Self-Reference in the Construction of Dictionaries, via languagelog. […]
leoboiko said,

October 1, 2012 @ 3:24 pm

From the paper, I reached Hierarchies in Dictionary Deﬁnition Space:

We reduced dictionaries to their “grounding kernels” (GKs), about 10% of the dictionary, from which all the other words could be deﬁned. […] one can compress still more: the GK turns out to have internal structure, with a strongly connected “kernel core” (KC) and a surrounding layer, from which a hierarchy of deﬁnitional distances can be derived, all the way out to the periphery of the full dictionary.

It would be interesting to see various kernels/cores calculated from different languages/dictionaries. I wish there was software to generate cores for the various translations of Wiktionary.
Theodore said,

October 1, 2012 @ 3:35 pm

Obligatory reminder: All Wikipedia articles eventually link to Philosophy. This is probably a kernel core of sorts. The rich get richer indeed.
Q. Pheevr said,

October 1, 2012 @ 4:06 pm

I'm a bit worried by the leap in the abstract from English dictionaries, which are published texts, to the English lexicon, which is an object in the mind. The paper itself acknowledges that the two are not the same thing, and says that dictionaries "provide snapshot representations of the lexicon and as such provide an extremely useful model for studying the lexicon." Still, it's not clear to me how good that model is. Dictionaries necessarily define words in terms of other words; it's not obvious that meaning is represented in at all the same way in the mental lexicon. There are certainly links between words in the mind, which can be identified through psycholinguistic priming experiments, but these are not necessarily the same kinds of links that can be found in a dictionary, particularly because not all of them are based on meaning.
AntC said,

October 1, 2012 @ 7:54 pm

I'm interested in the "rich-get-richer" angle, but the body of the paper doesn't really seem to expand on it.

Would there be some sort of information entropy going on? The more near-synonyms a word has, the less specific is its meaning? (Or perhaps the inverse of that: the more specific a meaning, the more words are needed in the definition to get the right nuance? In such a case, does each word of the definition 'count for one'? Is each really an "alternative word"?)

The paper discusses the problem of polysemy and the risk of semantic misintepretation, but I'm not convinced they really get over it "… by considering only the first sense of a word in the event of polysemy." Is the first-given sense necessarily the most insight-yielding for these purposes?
Lazar said,

October 1, 2012 @ 8:05 pm

I wonder if, given enough time and effort, it would be possible to learn a language from examining a monolingual dictionary?
AntC said,

October 1, 2012 @ 8:44 pm

@Lazar: not possible. Knowing that the word "snow" means the same as "frozen crystalline water" does not tell you what either means. That's why the paper introduces the notion of definitional loops or clusters of words that appear at the same time etymologically.

And they call this self-reference (cue Russell’s paradox and Go¨del’s theorem); but I really find that gratuitous. A dictionary is not trying to construct meanings from some axiomatic base. If two words are near-synonyms, it's no surprise that one appears in the definition of t'other, and vice versa. It's the dictionary that's doing the referring, not the words themselves.

A self-referential definition would be something like: ancestor = parent, or ancestor of parent.
John Lawler said,

October 1, 2012 @ 10:10 pm

Another fun activity is to do this with a bilingual dictionary — look up a word in one language and then look up the definition word in the other language, then look that up in the first language, etc.

It loops pretty frequently, in my experience. And it almost always leads to interesting semantics.
David Morris said,

October 1, 2012 @ 10:44 pm

I have just completed my masters degree through the University of New England, Australia. The linguists there are very into Natural Semantic Metalanguage, which posits that every language has a set of (currently 64) "semantic primes", by which everything else can be defined or translated into other languages. I think it's a useful tool, but was sometimes put off by the fervent zeal with which they would raise this at every opportunity.
AntC said,

October 1, 2012 @ 11:25 pm

@David Morris "semantic primes" could be a parlour game: pick a number, adjust your set to fit:
– earth, water, air, fire, (quintessence)
– animal, vegetable, mineral, abstract
– how many did Roget pick, and why?

I think Yorick Wilks was playing this game for machine translation back in the 1970's: enough semantic markers to mediate between the source and target languages' polysemy.

How many primes does it need to distinguish the fabled variety of words for snow?

The whole dictionary-enterprise seems to me mired in arbitrariness (including the paper Mark references — why cut off at a cycle length <= 5?). I'm sure lexicographers have rules of thumb. Perhaps any metrics from 'Lexical loops' is just reflecting their rules?
richard howland-bolton said,

October 2, 2012 @ 6:26 am

@ John. Some time ago (2003) I did a similar experiment for one of my silly radio essays. If I can quote the results:

Years ago, there was, for example, in the early days of machine translation a famous incident (and probably, like most famous incidents, an apocryphal one too) that involved the Russians, the Americans and quite possibly the beginnings of détente.

The received story is that the CIA set up a computer to enable them to quickly and automatically translate to and from Russian. Searching for a suitable phrase to test the machine they came up with “Out of sight out of mind” (that, to digress, surely says a lot about the shortcomings of the CIA). They fed this into the computer which duly spat out some Russian (exactly what that was isn’t recorded, which surely says a lot about the shortcomings of language instruction in the English Speaking world) which they then fed back in. Out came the perfectly well-formed English phrase “Invisible imbecile”.

Now what set me thinking about this rather than gnawing on my feet was the fact that, in the international world of the internet, the people you deal with can be just about anywhere. And that a very nice guy in Finland had helped me with some software. I wanted to tell him how well it worked, and embarrassed by his excellent English, had a bright idea. For my e-mail reply I put the passage "It works! It works! Thank you very much. This is wonderful!" through an on-line translation engine to turn it to Finnish (it came out as something beginning “Se tehdas! Se tehdas! Kiittää te …!”) and then I added (in English) the hope it didn’t say anything bad in Finnish. Now with that thought in mind I immediately translated it back as a check (the only thing I’ve EVER had in common with the CIA) It came out as "It mill! It mill! Praise you very much. This is transcendental!" Well I couldn’t leave it there, could I, so I repeated the process, back and forth until it eventually settled down as: "Se jyrsiä! Se jyrsiä! Kehua te erittäin hyvin. Nyt kuluva on ihmeellinen!", which your Finniphone friends (just as soon as they stop rolling about on the floor at my pronunciation) will tell you is "It crop! It crop! Boast you very much. This is [still] transcendental!"

As you’ve seen from my feet fetish, I’m not someone to leave things well alone and there I was with the translation software to hand: so I couldn’t help but repeat the “Out of sight out of mind” experiment. I tried various languages, most of which were, sadly rather boring, nay even accurate. All of them ended in phrases which always translated the same way, back and fowards, (as you’d expect) except the French series which, of course, ended with “hopeless, nil, non-existent” which I think was less of a perceptive translation than the engine giving up in typical French disgust. I was about to give up myself when I tried Chinese.

The Chinese series went like this (and by the way, I’ll only give the English side to avoid the sort of fiasco we had back there with Finnish):

Out of sight out of mind;

Stemming from sight outside brains;

Source to sight outside brain;

The origin sees exterior brain
Kyle said,

October 2, 2012 @ 8:14 am

On the one hand this is a cool project that I have always thought would be interesting to do.

On the other hand, there are a number of structural things off about this paper. It is published in "J Phys Rev X", which doesn't give a lot of confidence in the peer reviewers knowing much about the domain area. Time to publication was 24 days (!) which, while great (especially if the norm in statistical physics) tells me again that this was most likely not reviewed by linguists / language change specialists. It is represented as being about dictionaries, but they used Wordnet, which is very much not a dictionary (it may share some properties, but the "looping" behavior I expect would be extremely different, as would many other properties they discus). And finally, their citations into the literature on lexical representation and language change are minimal at best. This last point makes it very difficult to align their assumptions to what I usually take for granted about concept / word mappings and so on.

For example, "We thus associate the introduction of a new concept into language at a given time with the appearance of at least one word at that time that was not definable at earlier times." It isn't clear that not having a word in the language means there is no concept. But setting that aside, they are only using synchronic data, as far as I understand it, and thus can't tell what was or wasn't definable in earlier stages. Moreover, their notion of "definable" isn't really one I can easily understand. This somewhat reminds me of traditional claims about tenseless languages that the speakers couldn't understand time, etc. But of course they can — these language have many strategies for characterizing the temporal relations that could in other languages be accomplished with tense, once you go looking for them, and there are many ways of characterizing roughly the same meanings. There's strange assumptions like this all over the place; "…reflects our basic intuition that new concepts must be self-contained…" Why would this be true? And why would their intuition matter?

It's just hard for me to believe that this paper wouldn't have changed radically if it went through a peer review process with domain experts involved.
David L said,

October 2, 2012 @ 8:32 am

@Kyle: the paper was received 3 Feb 2012, published 27 Sept, according to the abstract. That's a bit more than 24 days…
Acilius said,

October 2, 2012 @ 8:40 am

The abstract reminds me of another physicist, Georg Christoph Lichtenberg, who has an aphorism about definitions somewhere in his notebooks. I don't have an edition of Lichtenberg's notebooks with me at the moment, but if memory serves it is something to the effect that a definition represents an attempt to arrange familiar words into a ladder that we may climb in order to reach understanding of an unfamiliar concept. That notion has always intrigued me; to the extent that it is an adequate description of the function of definitions, it would imply that there are some words we will never be able to define, as they are attached to concepts we must grasp before we are familiar with any words. The word "word," for example. Be that as it may, the paper sounds interesting. Thanks for the link!
M (was L) said,

October 2, 2012 @ 9:38 am

If you play the same "synonym tracing" game with a thesaurus, which I suspect is a handier model for the purpose, you'll find that it's very often possible to trace a word to its direct antonym, by tracing synonyms. Rather, I should say near-synonyms; by choosing at each stage a shading of meaning just a little further along the spectrum, you will usually get there.

In the extreme you could conclude that (a) the thesaurus is useless, if it eventually equates "up" with "down" – or (b) that most words cover some blurry semantic zone that overlaps others, and are not a set of sharply-defined discrete and distinct meanings. It's (b) of course.

Something might be concluded from the "distance" between a given word and the antonym. However, what exactly IS the antonym of a blurry-edged term? Perhaps whatever the thesaurus says it is, whatever the thesaurus lists as an antonym – or rather, any of them.

I assume that somthing like this has been done, half-a-dozen different ways at least. Does anybody know of such work?
Kyle said,

October 2, 2012 @ 10:52 am

@David L You are right, sorry, I somehow read "sep" as "feb".

I want to temper my negative comments a little (now that I've had some coffee and woken up). I do think what this paper is doing is very interesting and fairly unique, and the actual techniques seem super interesting / promising. But the lack of engagement with research in the empirical domain of the paper is going to make it hard for people in (what I assume must be) the target audience to make any headway in getting these results somewhere that is meaningful to them.
Jerry Friedman said,

October 2, 2012 @ 12:26 pm

For fans of back-and-forth translating, there's Bad Translator. Another site, Translation Party, isn't working for me.
Peter Taylor said,

October 3, 2012 @ 2:28 am

@John Lawler, when I started doing Spanish crosswords I used a dictionary reflection like that as a poor man's thesaurus.
mollymooly said,

October 3, 2012 @ 6:15 am

The Longman Dictionary of Contemporary English is an ESL dictionary that claims to define "207,000 words, phrases and meanings" using "only 2000 common words". So you could just look for loops within those 2000.
M (was L) said,

October 3, 2012 @ 7:33 am

I'm jus wondering, if we limit the scope of the problem to a set as small as 2000 words, then is every word within six steps of "bacon?"
AntC said,

October 3, 2012 @ 6:22 pm

Seems myl has an entry in the "parlour game": Towards a "Universal Dictionary" for Multi-Language Information Retrieval Applications http://www.ldc.upenn.edu/myl/may01.pdf
"… about ten thousand inflected forms or about 7500 lemmas …" [And he will quite correctly castigate me for ripping that out of context.]
Mark Changizi said,

October 5, 2012 @ 2:15 pm

For those interested in this topic, you may also like this: http://changizi.com/dictionary.pdf

RSS feed for comments on this post

Lexical loops

22 Comments

Core words : Leonardo Boiko’s background diary said,

leoboiko said,

Theodore said,

Q. Pheevr said,

AntC said,

Lazar said,

AntC said,

John Lawler said,

David Morris said,

AntC said,

richard howland-bolton said,

Kyle said,

David L said,

Acilius said,

M (was L) said,

Kyle said,

Jerry Friedman said,

Peter Taylor said,

mollymooly said,

M (was L) said,

AntC said,

Mark Changizi said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta