Language Log

Stylometric analysis of the Sony Hacking

January 10, 2015 @ 10:59 pm · Filed by Victor Mair under Language and computers, Language and the media, Language and the movies, Linguistics in the news

« previous post | next post »

The question of who was behind the hacking of Sony peaked a couple of weeks ago, but it is still a live issue. The United States government insists that it was the North Koreans who did it:

"Chief Says FBI Has No Doubt That North Korea Attacked Sony" (New York Times — January 8, 2015)‎

James B. Comey, director of the Federal Bureau of Investigation, said on Wednesday that no one should doubt that the North Korean government was behind the destructive attack on Sony’s computer network last fall.

Others provide evidence that North Korean hackers may have carried out the attacks with Chinese assistance:

"Did China Help North Korea Hack Sony?" (Forbes — 12/21/14)

The evidence suggests Beijing had to have been aware of North Korea’s hacking of Sony as soon as it began and was undoubtedly complicit in that crime.

Why? An intelligence official, speaking anonymously to Fox News this week, stated the “final stage of the attack” was launched outside North Korea. Ars Technica reports that the attacks originated from Chinese IP addresses.

"Sony hack: China may have helped North Korea, US states" (The Telegraph — 12/19/14)

China may have helped North Korea carry out the hacking attack on Sony Pictures, a US official has told Reuters.

The official, who spoke on condition of anonymity, said the conclusion of the US investigation was to be announced later by federal authorities.

"North Korean defector: 'Bureau 121' hackers operating in China" (CNN — January 8, 2015)

On the streets of the neon-lit Chinese city of Shenyang, you'll find a restaurant, hotel, and other businesses owned and operated by the North Korean government.

Other sources extend the claims of possible complicity beyond China to Iran and Russia:

"Evidence in Sony hack attack suggests possible involvement by Iran, China or Russia, intel source says" (Fox –12/19/14)

Earlier Thursday, Fox News confirmed that the FBI is pointing a digital finger at North Korea for the attack.

The source pointed to the sophistication of malware “modules or packets” that destroyed the Sony systems — on a level that has not been seen from North Korea in the past — but has been seen from Iran, China and Russia.

In "Sony hacker language" (12/21/14), we raised the possibility that linguistic analysis of their communications might reveal who the hackers were. The post and its comments were inconclusive about the accuracy of identification based on the pattern of mistakes in the English used by the hackers.

More recently, there have been assertions that linguistic evidence in the Sony hackers' phraseology indicates they are Russian rather than Korean:

"Taia Global Linguists Establish Nationality of Sony Hackers as Likely Russian, not Korean"
(N.B.: The word "likely" was added to the title on January 8, 2015.)

Executive summary (12/2414):

Brief summary on boingboing:

Shlomo Argamon, Taia’s Global’s chief scientist, said in an interview Wednesday that the research was not a quantitative, computer analysis. Mr. Argamon said he and a team of linguists had mined hackers’ messages for phrases that are not normally used in English and found 20 in total. Korean, Mandarin, Russian and German linguists then conducted literal word-for-word translations of those phrases in each language. Of the 20, 15 appeared to be literal Russian translations, nine were Korean and none matched Mandarin or German phrases.

Mr. Argamon’s team performed a second test of cases where hackers used incorrect English grammar. They asked the same linguists if five of those constructions were valid in their own language. Three of the constructions were consistent with Russian; only one was a valid Korean construction. “Korea is still a possibility, but it’s much less likely than Russia,” Mr. Argamon said of his findings.

Fuller summary on Digital Dao:

"Linguistic Analysis Proves Sony's Hackers Most Likely Russian, Not Korean" (12/26/14)

Taia Global's Chief Science Advisor Dr. Shlomo Argamon, one of the country's preeminent researchers in authorship analysis and stylometry, led a team that conducted native language identification (NLI) analysis on the 20 messages left by Sony's hackers. Their results do not support the U.S. government's charge that North Korea was responsible for the network attack against Sony Pictures Entertainment. This post is a mini-version of the full report, which can be downloaded from the Taia Global website.

I wrote to Taia Global and they swiftly sent me their white paper, which I read with great interest. I cannot copy passages here because that would not be ethical, but I can describe the contents of the white paper. Before doing so, however, I wish to state that, when I initially heard about the Taia Global report, I was skeptical they could accomplish what they claimed, namely, a tentative identification of the Sony hackers on the basis of linguistic features. After carefully reading through the white paper, I am more inclined to agree that the fractured English of the hacker messages somewhat resembles Russian, Korean, Mandarin, and German in that descending order.

Incidentally, also on Digital Dao is this relevant, recent article by Jeffrey Carr, President and CEO of Taia Global, which once again casts doubt on North Korea as being the sole source of the Sony hacking:

"FBI Director Comey's Single Point Of Failure on Sony" (1/7/15)

It simply isn't enough for the FBI director to say "We know who hacked Sony. It was the North Koreans" in a protected environment where no questions were permitted (I never allow that at Suits and Spooks events). The necessity of proof always lies with the person who lays the charges. As of today, the U.S. government is in the uniquely embarrassing position of being tricked by a hacker crew into charging another foreign government with a crime it didn't commit. I predict that these hackers, and others, will escalate their attacks until the U.S. figures out what it's doing wrong in incident attribution and fixes it.

The title of the Taia Global white paper is "Native Language Identification (NLI) Establishes Nationality of Sony’s Hackers as Likely Russian".

Contents

Executive Summary (4-5)

The Problem (5)

The Data (5-6)

Assumptions and Caveats (6-7)

Methodology (7-8)

Ruling Out Candidate Languages (7-8)

Similarity to Korean L2 Language (8)
Relying on The Gachon Learner Corpus 2.1

Results: Candidate Languages (9)
(Korean, Mandarin, Chinese, Russian, German)

Results: Korean L2 English (10)
(17 specific errors and 4 expected errors that are absent from the hackers' messages)

Conclusions (11)

Appendix A: The GOP ["Guardians of Peace"] Messages (12-19)

20 messages. Reading through all of these puts a completely different light on the language question from what I had gleaned after being exposed only to what was available in the major media. Since there is a much larger corpus here, you can start to see recurring errors and patterns that were not evident in the few items that were available in the press.

Appendix B: English Errors Identified in the Messages (20-21)

Lexical Errors and Infelicities (L) (20)
(contrasts original phrases in the messages with their intended meanings [from context])

Syntactic Errors and Infelicities (S) (21)

Appendix C: Languages Matching L & S Errors (22-23)

L Errors (22)

S Errors (23)

Appendix D: Google Translations of Native Language Texts (24-25)

Sample original message from the hackers — second half of message 17 in Appendix A with translation into Russian and then back to English from Russian and into Korean and then back to English from Korean. Naturally, neither the back translation from Russian nor the back translation from Korean corresponds exactly to the original GOP English message, but the Russian is much closer.

It is important to note that nowhere in the white paper is there any citation of words, constructions, or sentences in Russian, Korean, Mandarin, German, or any other language to match the errors and infelicities of the hackers' English. All of these errors and infelicities are described only in English. Consequently, it is difficult to fully and confidently assess the claims that are being made.

So, when it comes right down to it, what do we make of the Taia Global assertion that the most likely language of the Sony hackers was Russian?

In "New Study May Add to Skepticism Among Security Experts That North Korea Was Behind Sony Hack" (NYT — 12/24/14), Nicole Perlroth observes:

…Taia Global’s sample size is small. Similar computerized attempts to identify authorship, such as JStylo, a computerized software tool, requires 6,500 words of available writing samples per suspect to make an accurate finding. In this case, hackers left less than 2,000 words between their emails and online posts.

It is also worth noting that other private security researchers say their own research backs up the government’s claims. CrowdStrike, a California security firm that has been tracking the same group that attacked Sony since 2006, believes they are located in North Korea and have been hacking targets in South Korea for years.

Shlomo Argamon, the chief scientist at Taia Global, does good work overall, though I wonder in this case if the evidence is sufficient to justify the conclusions. At this point, I should interject that the comments on the previous Language Log post about the Sony hacking (cited above) were both intensive and extensive. They covered all the main issues that have been raised in the debate over the language and the identity of the hackers, and they provided numerous links to the best studies on the identity of the Sony hackers. Shlomo Argamon was a participant in those discussions,

I asked several trusted colleagues their opinion on this vexed question.

Here is the response of a specialist on German and East Asian languages (Chinese, Japanese, and Korean, especially the latter):

Well, I never had the impression that there was anything Korean about the mistakes, but I certainly don¹t claim any expertise in the kind of analysis a couple of those links you sent seem to have engaged in. I¹m still not sure why they ruled out some deliberate distortion by people who actually knew English far better than it appeared, but their checking with Slavic linguists sounded like a smart way to go, don¹t you think?

From a colleague who is both a trained computer cryptographer and a Chinese dialectologist:

The OED lists examples of "stylometry" going back to 1945. It's the same type of research as the "fingerprinting" that I used in studying deep historical affiliation of samples of regional Chinese, accent ID of Mandarin varieties, and parts of speech in written Chinese. These days I am working in an "ad-tech", which uses exactly the same principle to identify users on the Internet in order to hold auctions for ads to appear on their smartphones and browsers. You can see an example of data collected for the last of these applications — browser-fingerprinting — at Panopticlick.

But I will say this: just because a piece of code contains more strings that can be fingerprinted as originating with a Russian speaker than with speakers of other languages doesn't tell you anything about who is paying the programmers' salaries.

Here is the opinion of a colleague who is a linguist working on Asian languages and who is also well informed about computer analysis of texts:

I'm no expert on stylometry, but well-known cases of accepted results of which I am aware involve determining the authorship of texts of uncertain origin on the basis of statistical comparisons with texts of known origin. Because of the methodology, sample size is a critical factor. In the Sony hacking case, the situation is very different:

It is assumed

1a. that an L1 speaker will produce certain L2 errors that someone with a different L1 (including L1=L2) will only produce at significantly different frequencies (e.g., with p <0.05%);

1b. that the proportion of such distinctive errors over all errors is large enough to determine the L1 of the writer; and

1c. that there is no significant variation among L1 writers with respect to the distinctive errors they make.

2a. that the total amount of target text is large enough do a valid statistical analysis of the kind required;

2b. that there was only one author or, if more than one, only one L1 involved;

2c. that no errors in the target texts were purposely introduced to conceal identities;

2d. that no errors were the result of machine-translation software; and

2e. that either no control text is needed for comparison or that a corrected version of the target text constitutes a valid basis for comparison.

For such reasons, I find the claim that the hackers were more likely to be native Russian speakers rather than native Korean speakers highly dubious.

Anyway, even if they were Russian speakers, so what? Were they employed by North Korea? Were they assigned to the job by Putin? Were they Russian nerds off on a lark? No one has proven (as far as I know) that it wasn't an inside job carried out by a vengeful (ex-)employee, or someone out to take down rivals or make space above him/her on the corporate ladder. Surely understanding motives is more important than "stylometry."

Given that Sony is a Japanese company, I'd say that the hypothesis that North Korea is responsible is still the best, even if they got some outside help.

After reviewing all of the above, together with the previous Language Log post and the comments thereto, my view is that North Koreans were involved in the hacking, but that they may have received assistance from Chinese and / or Russian cyber specialists. I do not believe that the Taia Global report, in and of itself as I have seen it, provides conclusive linguistic evidence that Russians were the main perpetrators.

[Thanks to Jim Unger, Bob Ramsey, David Branner, Geoff Pullum, and Ben Zimmer]

January 10, 2015 @ 10:59 pm · Filed by Victor Mair under Language and computers, Language and the media, Language and the movies, Linguistics in the news

Permalink

13 Comments

Nick said,

January 11, 2015 @ 12:34 am

Implicit in the list of assumptions is the possibility that the L2 of North Korean hackers is Russian, and that English is perhaps L3. I can't imagine that styleometry is capable of sorting out the confounding effects of multiple language study?
Peter Taylor said,

January 11, 2015 @ 4:09 am

As a side note, the quote which refers to a "digital finger" has me wondering whether there's any other kind.
Robot Therapist said,

January 11, 2015 @ 4:25 am

I would not expect that stylometry would work if the writer was deliberately trying to mislead as to their native language
Martin Ball said,

January 11, 2015 @ 6:25 am

@Peter Taylor, off-topic but we once had to refer to a strange sound on a recording of disordered speech as a 'digital portal tap' ;)
AB said,

January 11, 2015 @ 10:10 am

Your cryptographer colleague makes a good point. The person or people who did the deed may have no idea whom they are working for, and vice versa.
Marek said,

January 11, 2015 @ 10:50 am

The discussion that ensued in the comments during the previous post has more or less convinced me the most plausible source for any errors was machine translation (see: point 2d in your colleague's report), rendering the entire point moot. Which makes it odd that the press has largely ignored this option.

It also raises an interesting possibility that widely-available machine translation services like Google Translate may double as an anonymizing service, some sort of (artificial) standard for language, not traceable to a particular author.
Shlomo Argamon said,

January 11, 2015 @ 3:10 pm

Victor, thank you very for this discussion of the issues, and your careful consideration and critique of our work. I will write a more comprehensive response to the issues you and others raise in this post soon, when I have the time, but I wanted to briefly respond to some of the key points.

1. The main implication of our work is that there is good reason to doubt a simple attribution of the hack to the North Koreans. If the hackers were Russian, then that fact needs to be explained even though they could have been working for NK.

2. We do not rule out the possibility that the L1 was Korean – our results suggest though that Russian is more likely than Korean. The analysis we have done so far is not comprehensive, nor do we claim it is (I apologize for the misleading too-definitive title of the report, which has since been changed). But the analysis does cast serious doubt on the idea that the message writers were North Korean.

3. If we assume L1 interference to be an influence, it would seem unlikely for an L2 to have more influence on an L3 text, than the L1, unless the L2 is very similar to the L3.

4. Your correspondent's assumption 1c is not needed – all that is needed is that there are overall consistent error patterns for different L1s, which phenomenon is pretty well documented. We do document that we assume that all of the texts are from writers with the same L1. We are currently looking at ways of testing this assumption.

5. The text is indeed not large enough to make any rigorous or semi-rigorous statistical estimates of probabilities, but the analysis gives a pretty clear qualitative ordering of the likelihoods of the several possibilities we considered.

6. We looked at the possibility that translation software was used – Google Translate at least does not seem strong enough to have been used to produce any of the texts as a whole, and the errors do not seem similar.

7. It seems unlikely that explicit error introduction would produce so many consistencies with a particular candidate L1, unless the writers were linguistically savvy and considered the use of this sort of attribution technique. Maybe, but that seems rather unlikely, as linguistic attribution has not previously been used in this sort of case.

We are currently starting a second, more extensive analysis of the texts, and would welcome any comments or help from the linguistic community.
Adrian said,

January 11, 2015 @ 7:22 pm

Robot: As far as I'm aware, it's not that easy to pretend your L1 is something else.

Marek: Google Translate has many quirks that can give clues to what the original language was.
Simon P said,

January 12, 2015 @ 5:57 am

Peter Taylor said: "As a side note, the quote which refers to a "digital finger" has me wondering whether there's any other kind."

A manual finger?
Marek said,

January 12, 2015 @ 8:01 am

@Adrian: I ran the original hacker message through Google Translate (translating it into a number of languages then back into English), and, I admit it *does* match Russian re-translation very closely, while Korean (and other languages) are pretty far apart.

BUT (and this is a pretty big but): surely an English-speaking Korean author could have just machine-translated the original (grammatically correct) English message into Russian, then back, thus simulating L2 errors of a Russian speaker?
Eric said,

January 12, 2015 @ 10:40 am

Re: Shlomo–in your comment, point #3 *may* be exactly the wrong assumption. Though work on learners of L3s (third and fourth and fifth etc. languages) is rather limited, there is some evidence that, when learning a L3, interference from L2 can be even stronger than interference from L1. Still, I tend to share Nick's skepticism about styleometry's ability to untie the varied confounding influences of multiple languages on a single text.
Jongseong Park said,

January 12, 2015 @ 11:52 am

Has the same sort of stylometric analysis been performed on English-language samples of similar length written by native speakers of Korean, Russian, and other languages? In other words, have these methods been tested to be able to determine the L1 of the writers where this was independently verifiable?

I ask because from the descriptions given here, the analysis seems to rely on certain assumptions about how and what kinds of L1 interference would show up without justifying why that would be so.

In his comments to the previous Language Log post, Shlomo Argamon seemed to think that Korean L1 interference would simply consist of features of Korean inadvertently showing up, such as SOV word order. This goes strongly against my intuition about how native speakers of Korean with a fairly advanced command of English construct sentences in English, at least at the syntactic level. I've been looking at some examples from The Gachon Learner Corpus, and though most of these display a lower degree of proficiency in English, their errors for the most part cannot be easily attributed to simple retention of Korean language features. If I perform literal word-for-word translations on problematic phrases, I'm usually at a loss, for the simple reason that Korean and English word orders are so completely different that word-for-word translations are simply not viable without pretty extensive revisions of word order.

So when I read that linguists conducted literal word-for-word translations of the hacker's messages in Korean and tried to identify which phrases were literal Korean translations, I immediately have more questions about what they did exactly. Any chance that we'll be given an example of what exactly counts as a literal translation from Korean to English?
Jongseong Park said,

January 12, 2015 @ 12:04 pm

Is it possible to have a concrete example of how the stylometric analysis described would correctly identify samples in English produced by native speakers of Korean in preference to other languages, for example in The Gachon Learner's Corpus?

RSS feed for comments on this post

Stylometric analysis of the Sony Hacking

13 Comments

Nick said,

Peter Taylor said,

Robot Therapist said,

Martin Ball said,

AB said,

Marek said,

Shlomo Argamon said,

Adrian said,

Simon P said,

Marek said,

Eric said,

Jongseong Park said,

Jongseong Park said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta