Spam comment of the month
« previous post | next post »
Among the approximately 15,000 spam comments directed at LL over the past 24 hours, this is one of the few that made it past the filters to be dealt with by human moderation:
Ginger ultimately struck North Carolina on September 30 as a chinese culture massive disappointment.
The resulting embryo is afterward transported to tissue may occur, either acutely or chronically, over hundreds of times, sometimes with a little more.
I killed it anyway, of course, but I think it deserves some recognition.
Rube said,
May 28, 2014 @ 8:35 am
I would have believed the first paragraph was a bit of poetry from, say, Richard Brautigan.
Victor Mair said,
May 28, 2014 @ 8:50 am
I kill dozens of these every day too, but few as wild as this example.
One wonders:
a. how they are generated
b. what their purpose is
William Sudry said,
May 28, 2014 @ 8:58 am
The comment, not the embryo. Right?
Ginger Yellow said,
May 28, 2014 @ 9:22 am
Ginger ultimately struck North Carolina on September 30 as a chinese culture massive disappointment.
I demand an apology!
leoboiko said,
May 28, 2014 @ 9:38 am
@Mair:
a. Markov chains, perhaps?
b. Probably to get the spammer's IP/email through the moderation once. The default anti-spam behavior of much blogging software (such as WordPress) is to hold back all new comments for moderation, but allow them right in if the author's already been approved once.
Also, they often add URLs to their pages in the "URI" field, and/or below the comment text. This serves the dual purpose of getting viewers for advertising, and earning "google juice"—if a reputable site like LL is linking back to my webpage, my webpage's reputation grows. Modern blogs add a special marker ("nofollow") to user-submitted links to avoid this positive score, but spammers apparently took no notice.
Stan Carey said,
May 28, 2014 @ 9:40 am
Some of my spam has been weirdly cinematic recently, e.g. this one about unleashing monsters and blowing the world up.
Victor Mair said,
May 28, 2014 @ 10:07 am
A lot of them border on being outright ads. I just nixed this one:
=====
Poulan Weed Eater 16 25cc Feather – Lite – Xtreme Gas Line Trimmer.
): liberally added to mixes to help fight viral infections.
the older garden weeds together will be eliminated in their infancy.
=====
Christian W. said,
May 28, 2014 @ 10:15 am
I've tried to figure out why certain passages are grabbed out of certain articles, to see if a pattern emerges.
"[M]ay occur, either acutely or chronically over" is from a bodybuilding article on acupuncture, and "resulting embryo is afterward transported" is from an article on in vitro acupuncture, but "Ginger ultimately struck North Carolina on September 30" is a Wikipedia entry on Hurricane Ginger.
I'm dead curious to learn the philosophy behind the algorithm that both seeks these phrases out in disparate sources and decides they follow minimal grammatical rules to fool a reader.
AG said,
May 28, 2014 @ 10:16 am
Sit on this until they make another Aliens Vs. Predator movie, then sue those idea-stealing bastards for all they're worth!
Victor Mair said,
May 28, 2014 @ 10:35 am
The most common type of spam comments that I have to reject are those that just blandly praise us for having a nice blog.
Ellen K. said,
May 28, 2014 @ 12:33 pm
I find interesting that the first sentence (paragraph) is actually, unexpected grammatical, if you pay close enough attention (to get past that it doesn't make sense).
Zubon said,
May 28, 2014 @ 1:29 pm
The most common type of spam comments that I have to reject are those that just blandly praise us for having a nice blog.
Which makes it difficult to post a legitimate "good post" comment. Being a blogger would be a happier life choice if we received more positive feedback, the equivalent of Facebook likes and Google +1s, but short praise is usually a sign of spamming.
wally said,
May 28, 2014 @ 2:05 pm
15,000 spam comments over the past 24 hours
Do we have any idea how many (if any?) of those were wrongfully rejected?
[(myl) That might be a low estimate — there have been 4,574 comments caught by the spam filter in the past 105 minutes, which would translate to 62,729 per 24 hours.
I don't know how many of those might have been wrongly trapped, because there are far too many for me to check them manually, as I used to do when there were only a few hundred a day.
I get about one letter a month, on average, from someone whose comment got stuck in the spam filter, so that's a lower bound. I don't know how many people just give up.]
Tim Elfenbein said,
May 28, 2014 @ 3:17 pm
Some of you might be interested in Paul Kockelman's take on spam filters and other types of sieves: "The Anthropology of an Equation: Sieves, Spam Filters, Agentive Algorithms, and Ontologies of Transformation." http://www.haujournal.org/index.php/hau/article/view/hau3.3.003
AntC said,
May 28, 2014 @ 7:23 pm
@Victor outright ads.
Mark, was that spam actually trying to sell anything? Was there a clickable link? To what?
Why would someone (build a bot to) send spam for no purpose?
@Victor, @Zubon blandly praise us.
I'll have to make sure that my praise is not bland — or is Victor saying that the best form of praise is criticism? (Compare flattery/imitation ;-)
But same q re the bot-generated praise: why bother?
gacorley said,
May 28, 2014 @ 7:50 pm
@AntC: Just for reference, and example of the "bland praise" type (from my own site):
Very nice post. I just stumbled upon your weblog and wished to say that I have really enjoyed browsing your blog posts.
After all I’ll be subscribing to your rss feed
and I hope you write again very soon!
In my case, these are very easy to spot, since the site that received this comment is a podcast, but they're probably pretty obvious anyway, and since it's likely that this comment was written just once and then sent out automatically to hundreds of thousands of blogs, a simple Google search will reveal it for what it is if you really have any doubt.
Milan said,
May 28, 2014 @ 7:57 pm
@AntC
From my experience, there are seldom links in the comment itself, but the URI connected with the name usually leads to some very dubious website, if not outright malware. I'm not always sure whether the spammers actually hope to lure people in to visiting those sites, or whether they just want to raise their Google rank by having more links leading to it.
Dan Lufkin said,
May 28, 2014 @ 8:20 pm
@ Stan Carey — I read just recently that a site that reviews Netflix movies has a bug that replaces the last sentence of every review with the last sentence of some other review. This looks like an instance of that.
I haven't had time to try to track it down any further.
Victor Mair said,
May 28, 2014 @ 8:21 pm
@MYL: "I get about one letter a month, on average, from someone whose comment got stuck in the spam filter, so that's a lower bound. I don't know how many people just give up."
I also sometimes hear from individuals who can't get their legitimate comments through to us. Occasionally (rarely), this might be due to the spam filters, but it also happens for other reasons. For example, if you use an arrowhead (e.g., for derivation), your comment will be wholly or partly rejected, so use "from" or something like that to replace it.
The main thing I want to say here is that, if your well-intentioned comment is repeatedly rejected for any reason, don't hesitate to write to us to ask us if we know the reason why. In the worst case we'll put it up for you.
Victor Mair said,
May 28, 2014 @ 8:34 pm
@AntC
They aren't really outright ads, but, as I said, they "border on being outright ads". Seldom do they solicit business or tell someone how to place an order, but they do mention — often in a rather vague way — products and services.
As for "bland praise", I meant "vapid, empty praise", in the sense that they never engage directly with any of our posts but simply state that ours is a good blog, that they are going to come back, that they are going to introduce us to their friends, or that they heard about us from their friends, etc. Sometimes they even pretend to be chummy by telling us which of their friends or relatives they heard about us from, but they never refer to the content of any of our posts. In fact, the way I weed out most of the offending spam is by looking at which posts they are replying to, and there are two obvious give-aways for spam:
1. the post is old
2. what they say in their message is totally unrelated to the title of the old post
Now, sometimes, for various reasons, people will write good comments about posts that are a year old or even much older than that, but you can always spot it immediately as a legitimate post because the content of the comment matches the title of the post to which it is replying. In such cases, I immediately hit "Approve"!
Douglas Bagnall said,
May 29, 2014 @ 4:22 am
@leoboiko:
I believe recurrent neural networks are the state of the art when it comes to generating plausible nonsense (or equivalently, modelling written language).
Marek said,
May 29, 2014 @ 8:04 am
Wikipedia's got you covered (though not in great detail):
http://en.wikipedia.org/wiki/Article_spinning
It's quite a big thing. If you take a look around, you'll find ridiculously priced commercial software for this kind of stuff, not to mention hundreds of ad-powered 'free' spinners available online.
It's actually confused linguists in the past – I've seen examples taken from the Internet which were obvious word-substitution spam cited as arguments for grammaticality of certain constructions in articles and books.
Emily said,
May 29, 2014 @ 2:10 pm
@Marek: I'd be interested to see some examples of linguists that were thus misled, and the sentences in question.
Stan Carey said,
May 29, 2014 @ 3:33 pm
@ Dan Lufkin: Thanks; I didn't know about that. Could be connected.
Martin said,
May 29, 2014 @ 4:15 pm
Oh come on, I think you have written enough about Paul de Mans already.
Sybil said,
May 29, 2014 @ 5:36 pm
@Martin: I actually Googled "Paul de Mans" and now feel like an idiot. Because I still don't get it. (Follow me LL readers if you will!) (I feel enough that way given I'm doing final grades.) What is this comment meant to evoke? And I'm woman enough to admit I don't get it. And I resent the time I spent trying to figure out if you were actually saying something.
Come to think of it, that's what is at the bottom of all or a lot of our resentment of this type of spam. (And, I add sadly, a lot of academic burnout.) We start out assuming good faith: this person actually means to say something, however ineptly. We try to figure out what was meant. And many hours later, we reluctantly conclude that the writer was just "writing stuff" (as one of my students described it), hoping that if enough "stuff" was flung that some of it would stick.
I feel most sorry for those of my students who really and truly try to say something, however much their attempts may fall short. Bravo, you guys!
Sybil said,
May 29, 2014 @ 5:44 pm
I feel obliged to add that it's really extraordinary the number, depth, and breadth of the comments on this post. Do we so miss ML? (Of course!) Or do we miss the whole of LL? (Which again, of course.)
You've created a monster!
Brett said,
May 29, 2014 @ 7:06 pm
@Sybil: There was a recent discussion on Language Log of whether Paul de Man's writing really contained any information.
http://languagelog.ldc.upenn.edu/nll/?p=11870
http://languagelog.ldc.upenn.edu/nll/?p=11913
Sybil said,
May 29, 2014 @ 8:44 pm
@Brett: thanks for some actual information. How I missed that discussion? I can't tell, but I often forget stuff that isn't of immediate import.
Ken said,
May 29, 2014 @ 9:50 pm
It's no "Colorless green ideas sleep furiously," but it has its points.
Marek said,
May 30, 2014 @ 2:58 am
@Emily:
One example which caught my attention comes from a chapter on language in Shimon Edelman's (2008) book 'Computing the Mind' in which the author argues against Chomsky's strong judgements about the ungrammaticality of 'obligatory to follow' and in favor of graded ones.
I quote:
" This account implies that sentences should possess a GRADATION OF NATURALNESS, and indeed they do. Most prominently, making unlikely choices while generating sentences from a set of constructions leads to language that sounds vaguely strange or non-native, as if the speaker had a wordchoice “accent” while producing perfectly grammatical sentences. As an example, consider this passage from an e-mail message that urged me to go to a fake website to “update” my social security number and bank account details: “This instruction has been sent to all Smith Barney customers and is obligatory to follow.” A Google search for the phrase /obligatory to follow/ yielded 580 results (compared to 3,660,000 for /must be followed/). […]
Another phrase from the same message that looked suspect to me, /earnestly ask/, yielded 30,600 hits; interestingly, these too were dominated by phishing alerts and religious admonitions […] Were I in fact a SmithBarney customer, the two statistically sound choices for me at this point would have been […] (2) to infer that the message had been generated by a speaker of statistically odd English with less than honorable intentions. "
So it seems that while Edelman was aware of the top results being scams, he mistook a text spinner for 'a speaker of statistically odd English'.
Victor Mair said,
May 30, 2014 @ 6:55 am
Here's one that came in from a source that calls itself "government real estate auctions" and has this eddress:
zanyguy47399461.pen.io/x
ferdinand.lowerson@yahoo.de
212.96.68.36
=====
Some points may also help boost the immune system.
The most common in people with osteoarthritis of the people involved is,
in adult postoperative and chemotherapy can be beneficial in depression: a randomized controlled study in Germany,
and another. Some even use facial acupuncture but without
any form of treatment that is secreted by the pressure, most acupuncture treatments,
the factors. However, the midway point of view.
Have you tried acupuncture. Therafter they need a wheelchair, Chloe
had a stroke.
=====
Are they interested in pushing acupuncture? Or government real estate auctions? Both? Neither?
BTW, it was posing as a comment to this post: "Can cause" vs. "may cause" (May 17, 2011).
http://languagelog.ldc.upenn.edu/nll/?p=3146
If I had approved it, rather than killing it, this spam would have been added to the thread. Naturally, I killed it.
Victor Mair said,
May 30, 2014 @ 7:10 am
Most (nearly all) of the spam I nix is directed at old Language Log posts, as I've explained in earlier comments to this thread. Now, just as we are discussing spam comments in the present post, the following three specimens sneaked through our filters at 2:37 a.m., 2:41 a.m., and 2:42 a.m. this morning:
=====
1.
Ginger ultimately struck North Carolina on September 30 as a chinese culture massive disappointment.
I demand an apology!
From:
tiger airways
0 approved
tigerairvietnam.com/x
vietnambookingcon@gmail.com
113.172.127.11
2.
I feel obliged to add that it's really extraordinary the number, depth, and breadth of the comments on this post. Do we so miss ML? (Of course!) Or do we miss the whole of LL? (Which again, of course.)
From:
Eva Air
0 approved
evaair.vnx
vietquan231092@gmail.com
113.172.127.11
3.
I feel obliged to add that it's really extraordinary the number, depth, and breadth of the comments on this post. Do we so miss ML? (Of course!) Or do we miss the whole of LL? (Which again, of course.)
From:
Tiger Airways
0 approved
vanphongtigerairways.comx
sonluudan1@yahoo.com.vn
113.172.127.11
=====
I've killed thousands of spam comments, but this seems to be a new development (at least I haven't noticed it until the present discussion), namely, spam that is generated based on a currently active post.
It's not hard to guess what these three are trying to push.
Of course, I have removed all three of them.