"The victims are large and costly machines"

« previous post | next post »

From Shay Cohen, via Lyle Ungar, some output from a PCFG ("probabilistic context-free grammar") trained on the Wall Street Journal part of the Penn Treebank:

A manager is a better value of well-polished desks .

I have been able to force to be more receptive to therapy , and to keep the committee informed , usually in advance , of covert actions : ; the victims are large and costly machines .

The purchase of all women is in September .

Their museum had been dumping their securities for comment .

It can remember one million truly inspiring teachers from Rainbow Technologies .

This reminds me of a rash of relatively high-quality spam comments that came our way a few weeks ago. A small sample:

People point out "Ultimate Battling, " I say "Professional Snuggling! "

I am grumpier over a 3 calendar year previous ballerina which has a wedgie.


Just what will it mean each time a woman calls an individual "scrumptious"… very good? bad?

I'm going to live-tweet this specific fart. Err. Succulent. Acidic. And…. completed.


  1. Henning Makholm said,

    March 6, 2012 @ 2:13 pm

    The spammers are probably using regular old word-based Markov chain generators. The PCFG output seems to have slightly better sentence-level coherence, but there's no need for a spammer to employ such advanced methods as long as spamfilters don't attempt to do whole-sentence analysis anyway.

  2. Theo Vosse said,

    March 6, 2012 @ 2:28 pm

    I've built my fair share of generators. This is output from one that was hand-crafted on titles of new age songs:
    No footprints on the blue sky
    Clouds and wings
    Rhythm of the eternal forest
    Moving movements
    and my personal favorite: Dawn of the bamboo day

    And this are just three random examples of output from a trigram generator. Can you guess which corpus was used as input?
    Neighborhood Size Effects of Grammatical Gender: a PDP Approach
    Effects of Chinese Words in French
    Finiteness and the Computation of Agreement

    And now that I'm on it, this is from a Haiku generator:
    Een uitgerekend grijs zoekt
    Gradaties stilstaan

    in English it's probably not a haiku:
    A computed gray searches
    Levels of stand still

  3. Sili said,

    March 6, 2012 @ 3:15 pm


    That looks like a genuine "Ask LanguageLog" question.

  4. Jens Ayton said,

    March 6, 2012 @ 4:42 pm

    I’ve been enjoying a Markov bot called @RandomTEDTalks on Twitter. Examples include “The Hunt For A Future That Never Happened”, “The Hunt For A Future That Never Happened”, “Hooked By An Octopus”, and “Charles Leadbeater Weaves A Tight Argument That Isn't Just Tedious, It's Irrelevant To Real Mathematics And The Clues To Past Civilizations”. Uncanny!

    (Actually, one of those was a real TED talk.)

  5. Jens Ayton said,

    March 6, 2012 @ 4:43 pm

    And one was a real copy & paste error!

  6. David Walker said,

    March 6, 2012 @ 6:48 pm

    Theo, I like that one:

    A computed gray searches
    Levels of stand still

    It sounds haiku-ish to me, even if it doesn't have the exact number of syllables. I forget the definition.

  7. George Amis said,

    March 6, 2012 @ 7:30 pm

    @David Walker

    Haiku have three lines, / seventeen syllables, five / seven five, like this.

  8. Rod Johnson said,

    March 6, 2012 @ 9:13 pm

    Many, many songs have been written from titles generated here. Examples:

    feelin' back the antihero
    evil (pacem)
    for my vomit
    heaven for the cracks
    don't bow down

  9. Sparky said,

    March 6, 2012 @ 11:53 pm

    The fart one was a human.

    Or else a winner of the Turing test.

  10. Andy Averill said,

    March 7, 2012 @ 12:25 am

    In other news, colorless green ideas actually do sleep furiously.

  11. Alex Boulton said,

    March 7, 2012 @ 8:34 am

    Reminiscent of the Postmodernism Generator (cf. the Sokal hoax): create your own instantly publishable paper, as meaningful as many… http://www.elsewhere.org/pomo/

  12. Mr Punch said,

    March 7, 2012 @ 12:05 pm

    You can't expect much in the way coherence if you train on the WSJ editorial pages.

  13. MBM said,

    March 7, 2012 @ 1:06 pm

    Humorous text generators and lousy machine translation, those are the achievements of computational linguistics.

  14. Toma said,

    March 7, 2012 @ 1:16 pm

    In the 1980s, I had a BASIC program that generated poetry. It randomly chose words and plugged them into a pattern like "adjective noun present tense verb adverb" and so on. They usually turned out pretty funny in a meaningless sort of way. Sounds like these spammers have only a slightly more advanced version of this.

  15. patricia said,

    March 7, 2012 @ 2:43 pm

    these remind me of the nonsense strings in those candidate "Bad Lip-Reading" videos, such as http://youtu.be/BhDhDRvHaGs

  16. cxpli said,

    March 7, 2012 @ 2:47 pm

    I'd guess those spam comments are actually random tweets gathered from Twitter with certain words replaced with synonyms. They make too much sense to have been computer-generated, and with a bit of word substitution you can guess at the original. e.g:

    I am grumpier over a 3 calendar year previous ballerina which has a wedgie.
    = I am grumpier than a 3 year old ballerina with a wedgie.


  17. Dave M. said,

    March 7, 2012 @ 3:29 pm

    @cxpli and @Sparky:

    Yup, at least some of the spam comments are from Twitter. For example, the fart one is derived from a tweet by Rainn Wilson:

    I'm going to live-tweet this fart. Hmmm. Moist. Acidic. And…. done.


    Obvious find and replace:
    "this" — "this specific"
    "Hmmm." — "Err."
    "Moist" — "Succulent"
    "done" — "completed"

  18. David Eddyshaw said,

    March 7, 2012 @ 6:23 pm

    "A manager is a better value of well-polished desks."

    Alas, this is only true of the elite. Few managers can truly be said to attain this level.

RSS feed for comments on this post