Contextualized Muppet Embeddings

« previous post | next post »

Over the past few years, it's been increasingly common for computational linguists to use various kinds of "word embeddings".

The foundation for this was the vector space model, developed in the 1960s for document retrieval applications, which represents a piece of text as a vector of word (or "term") counts. The next step was latent semantic analysis, developed in the 1980s, which orthogonalizes the term-by-document matrix (via singular value decomposition) and retains only a few hundred of the most important dimensions. Among other benefits, this provides a sort of "soft thesaurus", since words that tend to co-occur will be relatively close in the resulting space. Then in the 2000s came a wide variety of other ways of turning large text collections into vector-space dictionaries, representing each word as vector of numbers derived in some way from the contexts in which it occurs — some widely used examples from the 2010s include word2vec and GloVe ("Global Vectors for Word Representation").

The latest trend is for "contextualized" word representations, in which each word is represented by an array of numbers that depends not only on its distribution in training texts, but also on its context in the particular case being analyzed. Three examples emerged in 2018:

The emerging pattern is obvious. There will be more contextualized word-embedding methods — and soon we can expect to see the acronyms ERNiE, GRoVEr, KERMiT, …

Update —  Yuval Pinter was way ahead of me…

 



11 Comments

  1. Yuval Pinter said,

    February 13, 2019 @ 9:11 am

    (Can we embed tweets in comments?)

    Several New Ultimate Feature Finders Letting Embeddings Use Procedurally Acquired Global Universal Structure https://t.co/NRc3qPFAKP— Yuval Pinter (@yuvalpi) October 12, 2018

  2. Philip Taylor said,

    February 13, 2019 @ 11:27 am

    I don't know about anyone else, but for me the capitalisation of "Several New Ultimate Feature Finders Letting Embeddings Use Procedurally Acquired Global Universal Structure" makes it completely impossible to decipher. Of course, it might not convey much more (to me) even if conventionally capitalised, but as least conventional capitalisation would encourage me to work at it, whereas as capitalised now I find it a complete off-put.

  3. Jen in Edinburgh said,

    February 13, 2019 @ 11:46 am

    I think it would mean less without the capitalisation – try it acrostic-style :)

  4. Philip Taylor said,

    February 13, 2019 @ 2:16 pm

    Ah. I see. Now you (all) know why I loathe cryptic crosswords ! But even after acrosticising it, I still get lost at the last part : "Snuffle up agus" ?

  5. Jerry Friedman said,

    February 13, 2019 @ 2:53 pm

    Here's Mr. Snuffleupagus.

  6. Jerry Friedman said,

    February 13, 2019 @ 2:53 pm

    Trying again with matching quotation marks:

    Snuffleupagus

  7. Jen in Edinburgh said,

    February 13, 2019 @ 2:54 pm

    Are you too old for it or too young? :-)

    https://en.m.wikipedia.org/wiki/Mr._Snuffleupagus

  8. Philip Taylor said,

    February 13, 2019 @ 3:10 pm

    Neither, Jen — I have simply never watched Sesame Street, and wasn't even aware of its existence until Mark used one of the characters' names without glossing it, which prompted me to ask to whom he was referring …

  9. Bloix said,

    February 13, 2019 @ 3:44 pm

    Snuffleupagus is literally his last name – hence the Mr. His given name is Aloysius. I'd always assumed he's a Greek-American muppet, but on reflection Aloysius is Latinate, isn't it?

  10. Rodger C said,

    February 14, 2019 @ 7:51 am

    Aloysius is one of those many descendants of Chlodowig (or whatever)

  11. Joshua K. said,

    February 14, 2019 @ 10:49 pm

    I don't understand what ULMFiT alludes to. I know enough about Sesame Street to recognize Elmo, Bert, Ernie, Grover, Kermit, and Snuffleupagus, but I don't see how "Ulmfit" fits in there.

RSS feed for comments on this post