Anatomy of a spambot
« previous post | next post »
We've often had occasion to wonder how spammy blog comments are linguistically constructed. (See, most recently, Mark Liberman's post, "Numerous upon the written content material," in which he refers to spam comments as "aleatoric sub-poetry.") Now, on Quartz, David Yanofsky and Zachary M. Seward expose how spam comments are engineered:
Comment spam follows a formula, which was made plain the other day when a spambot accidentally posted its entire template on the blog of programmer Scott Hanselman. With his permission, we’ve reproduced some of the spam comment recipes here and added colorful formatting to make it readable. The spambot constructs new, vaguely unique comments by selecting from each set of options. We hope you find it wonderful | terrific | brilliant | amazing | great | excellent | fantastic | outstanding | superb.
A few examples:
I have | I've been surfing | browsing online more than three | 3 | 2 | 4 hours today, yet I never found any interesting article like yours.It's | It is pretty worth enough for me.In my opinion | Personally | In my view, if all webmasters | site owners | website owners | web owners and bloggers made good content as you did, the internet | net | web will be much more | a lot more useful than ever before. I couldn't | could not resist | refrain from commenting. Very well | Perfectly | Well | Exceptionally well written!I will | I'll right away | immediately take hold of | grab | clutch | grasp | seize | snatch your rss | rss feed as I can not | can't in finding | find | to find your email | e-mail subscription link | hyperlink or newsletter | e-newsletter service. Do you have | you've any? Please | Kindly allow | permit | let me realize | recognize | understand | recognise | know so that | in order that I may just | may | could subscribe. Thanks.
Wow, this article | post | piece of writing | paragraph is nice | pleasant | good | fastidious, my sister | younger sister is analyzing such | these | these kinds of things, so | thus | therefore I am going to tell | inform | let know | convey her.
Howdy | Hi there | Hey there | Hi | Hello | Hey! Someone in my Myspace | Facebook group shared this site | website with us so I came to give it a look | look it over | take a look | check it out. I'm definitely enjoying | loving the information. I'm book-marking | bookmarking and will be tweeting this to my followers!Terrific | Wonderful | Great | Fantastic | Outstanding | Exceptional | Superb | Excellent blog and wonderful | terrific | brilliant | amazing | great | excellent | fantastic | outstanding | superb style and design | design and style | design.
For connoisseurs of such automated quasi-synonymy, let me also note this passage from a piece I wrote for Lapham's Quarterly last year called "Word for Word," a reconsideration of Roget's Thesaurus:
Fans of the television show Friends may recall the episode in which the dim-witted Joey Tribbiani discovers the built-in thesaurus in his word-processing program and tries to spruce up a letter of recommendation for his friends’ adoption agency. He thesaurusizes every word, so that the sentence “They are warm, nice people with big hearts” turns into “They are humid, prepossessing homo sapiens with full-sized aortic pumps.”
That bit of sitcom silliness has actually turned into a grim reality, now that online content farms use so-called spinning software to modify a source text by automatically swapping out words with ostensible synonyms. (The goal is to create new textual fodder that can be used on websites without search engines like Google suspecting that the content has been duplicated from elsewhere.) I recently came across a particularly ham-handed example on a news aggregator which lifted an article from the Star-Ledger about a looming fight between two congressional candidates. The original said that “the Democratic showdown…will be bloody and fairly evenly matched considering the county machinery behind each candidate.” In the “spun” version, the showdown “will be full of blood and sincerely uniformly suited deliberation the county equipment at the back any candidate.” Sadly, this sort of thesaurus-driven gobbledygook can be found in abundance online, as if Joey and his full-sized aortic pump had taken over the Internet.
Victor Mair said,
April 23, 2013 @ 10:52 pm
Despite the millions of such spam messages that are caught by Language Log's filters, we still get a lot of these. I know because I've lately been going through the comments pending section of our dashboard and deleting scores of them. They certainly do follow a pattern similar to that described in the first part of this post.
Sol said,
April 23, 2013 @ 11:24 pm
I've been hearing some interesting stats about the internet lately, like how 1/3 of data traffic during peak hours in North America comes from Netflix. In that spirit, I wonder just how much of "the internet" is generated by bots?
Rahul said,
April 24, 2013 @ 12:34 am
Perhaps the link "on the blog of programmer Scott Hanselman" should point to the permalink of the relevant blog post.
Ian Tindale said,
April 24, 2013 @ 1:23 am
Hmm. This’d be a time-saving mechanism for marking student work.
Sharat B. said,
April 24, 2013 @ 1:26 am
In the spirit of your fascination with spambots, here's a wonderful talk by digital frontiers explorer James Bridle. He does a lot of fun work on new media (always quoting Walter Benjamin and William Gibson).
Matt_M said,
April 24, 2013 @ 1:39 am
I've had ESL students who have tried to "paraphrase" a source in the same way as Joey Tribbiani, with rather similar results.
[(myl) It's not just ESL students — some published authors seem to use such techniques…]
Daniel Tse said,
April 24, 2013 @ 2:26 am
"Do you have any?" -> "Do you've any?" demonstrates a poor understanding of the constraints on contraction…
maidhc said,
April 24, 2013 @ 4:08 am
Sportswriters were always noted for their use of the thesaurus, no doubt because they had to describe the same thing over and over, so they would never say "kicked the ball" when they could say "propelled the spheroid". I don't follow such things, so I don't know if this is still the case.
An augmentation of grandiloquence would be a rocambole to our quotidian discourse.
David Morris said,
April 24, 2013 @ 4:34 am
It's always obvious when ESL students have pulled a translation or synonym from the dictionary or translator.
Toma said,
April 24, 2013 @ 7:24 am
@maidhc:
In sports writing, there is some automated text going on. I have a baseball app that automatically generates game summaries. Sometimes it works ok, but sometimes you get something to the effect of: "Willingham powered the Twins hitting…" And then you look at the box score and he went 1 for 2 with 2 walks. Not exactly power hitting. It seemed to look at his batting average for the game, saw that he got a hit in half of his at-bats and plugged in the word power. So while it doesn't generate the kind of garbage that you see in spam comments, it doesn't always make perfect sense.
(See http://www.mediabistro.com/galleycat/forbes-among-30-clients-using-computer-generated-stories-instead-of-writers_b47243)
Jon Weinberg said,
April 24, 2013 @ 7:28 am
This response is coming nine years too late, but I don't agree with the claim (in the link myl provides attached to Matt_M's comment) that Harvard's first-year writing course trains its students to use big words. Harvard students like to use big words, whether they know their meanings or not (or at least that was so in my day), but the expository writing course isn't why.
Mark P said,
April 24, 2013 @ 8:14 am
Is the obvious clumsiness a result of not being a native English speaker, or just not caring enough to get it close to right?
Ralph Hickok said,
April 24, 2013 @ 8:52 am
@maidhc:
Not just sportswriters. Reporters, especially the young uns, often suffer from elongated yellow fruit syndrome, as it's known in the trade.
@Jon Weinberg:
I agree. If anything, Harvard's expository writing course tried to get students to use smaller words rather than big words. I don't know what your day was, but in my day the emphasis was much more on how to do research and how to use it properly than on the writing itself. (I was in an advanced section of the course because of my English score on the College Board Exams; it might have been different in the lower-level course.)
KevinM said,
April 24, 2013 @ 11:01 am
Not to mention the helpful shopping suggestions. If I look up "Spanish Inquisition," I'm sure to get several days of spam informing me that Target has the lowest prices on Spanish Inquisition. And beforesomeone says it, yes, it's entirely expected.
Mr Punch said,
April 24, 2013 @ 12:51 pm
The expository writing course at Harvard, and similar courses elsewhere, often (depending on the instructor) place some emphasis on precision of expression, which may of course involve replacing vague terms with more exact but less common ones. This is not the same thing as favoring big words.
KWillets said,
April 24, 2013 @ 3:01 pm
I once ran a spam corpus through a suffix tree and found all the maximal repeats (a fairly tricky process in itself). It was similar to the above, with a number of innocuous boilerplate phrases scoring quite high, eg "press this button to order". One was "the former dictator of Nigeria".
Spammers focus on defeating word-based filters, but often betray themselves in n-grams.
bfwebster said,
April 24, 2013 @ 3:08 pm
And yet, to a human who has encountered a few of these messages, it's almost immediately apparent that such a comment is spam and that it's following some kind of template. The apparent variety doesn't really vary the final text that much.
I ran into this problem some 30 years ago while developing a game — Sundog: Frozen Legacy — for the Apple II. We had a "speech" generator for when you interacted with non-player characters (NPCs), and it looked a lot like the template above: a sentence template with one or more tokens, where each token corresponded to a list of words. When the sentence was to be displayed, each token would be replaced by a random selection from its corresponding word list.
My co-author, Wayne Holder, and I were excited about this, but when we started trying it out, we quickly discovered that the generated text all tended to sound the same. Of course, we were working within extremely limited constraints — 64KB total memory in the computer, and both sides of a 140KB floppy (e.g., 280KB total), and that was for the whole shebang: operating system, graphics library, game, data, everything. Still, it's curious that this spam technique really hasn't really improved upon that.
kamo said,
April 24, 2013 @ 11:00 pm
Well that's one side of it. Any chance of finding out what things spam filters are looking for?
I'm convinced that just having a compliment in the first sentence is enough. On the depressingly rare occasions a genuine commenter on my blog opens by saying something nice about my writing, you can pretty much guarantee it'll get flagged as spam.
At least I think they're genuine commenters. They must be, right? Never mind Skynet, the robo-apocalypse will be heralded by Turing compliant spambots.
Mark P said,
April 25, 2013 @ 8:00 am
kamo, that reminds me of something one of the commenters here said at her blog (thegreenbelt.blogspot.com) about a spam comment that began with a criticism rather than a compliment.
cs said,
April 25, 2013 @ 8:40 am
Hello, this post is very interesting. I will be sure to read your blog regularly in the future. Also, I'm testing to see if this comment will be flagged as spam.
Jerry Friedman said,
April 25, 2013 @ 9:56 am
David Morris:
It's always obvious when ESL students have pulled a translation or synonym from the dictionary or translator.
I take it "always" is an exaggeration. Surely they sometimes hit on an acceptable word.
By the way, Angus McDiarmid may have been an early adopter of this method.
rvman said,
April 25, 2013 @ 3:42 pm
I ran across a website the other day which automatically substitutes for profanity or just deletes it without replacement(styleblazer.com), resulting in a bunch of references to Dick Ebersol in an article about Saturday Night Live coming out "package Ebersol". Another article about authors who disliked the movie versions of their books featured the movie "Mary Poppins", contained the sentence:
From the casting of American package Van as a cockney chimney sweep to Julie Andrews’ whimsical performance, James felt her vision was being sanitized.
Matt said,
April 25, 2013 @ 7:34 pm
What, they let "cockney" through? No stick-to-itiveness, that's the problem with autocensors these days.
JG said,
April 27, 2013 @ 8:23 am
What I don't understand about this is what they hope to gain from it. It's obviously spam, but there's no links that I can see, or even the name of a website or company. So where's the benefit to the spammer?
Michael W said,
April 28, 2013 @ 1:47 am
I find the social engineering attempts to be fairly interesting. I wonder if there are intentional mistakes designed to elicit sympathy for a non-native speaker.
This bit
seems to convey the same sentiment that R. Lee Ermey put more succinctly in Full Metal Jacket. ("Hell, I like you. You can come over to my house …")
Brian McHugh said,
June 1, 2013 @ 3:57 pm
This calls to mind the English writing style of the main Ukrainian character in Jonathan Safran Foer's novel *Everything Is Illuminated*.
Scott said,
February 5, 2014 @ 9:05 am
Joey and his full sized aortic pump…nice…that cracks me up. Awesome Friends reference BTW. I've been tracking automated spam/bots on my websites for a while now, since the fairly early days of my time in web design. Back then it was bots scraping guest books for email addresses. (remember those?) Which went into email spam. Now it's blog spam. Spinning content to get links, visibility, etc. Oy.
I'm pretty sure they're gonna get the automated language thing smoothed out before too long, and it's gonna get hard to detect. I'm sure there'll be an app for that. ("Siri, write comments on a blog for me.")