Language Log

Archive for Computational linguistics

Oxford-NINJAL Corpus of Old Japanese

April 11, 2018 @ 7:33 pm· Filed by Victor Mair under Announcements, Computational linguistics

From Bjarke Frellesvig (University of Oxford), Stephen Wright Horn (NINJAL), and Toshinobu Ogiso (NINJAL):

[VHM: NINJAL = National Institute for Japanese Language and Linguistics]

We are very pleased to announce the first public release of the
Oxford-NINJAL Corpus of Old Japanese (ONCOJ). We will be grateful if you
would circulate and share this information as appropriate.

The corpus is avallable through this website: http://oncoj.ninjal.ac.jp/

Read the rest of this entry »

Permalink Comments (4)

Alexa laughs

March 8, 2018 @ 11:11 am· Filed by Mark Liberman under Computational linguistics, Elephant semifics

Now that speech technology is good enough that voice interaction with devices is becoming widespread and routine, success has created a new problem: How should a device tell when to attend to ambient sounds and try to interpret them as questions or commands?

One solution is to require a mouse click or a finger press to start things off — but this can degrade the whole "ever-attentive servant" experience. So increasingly such systems rely on a key phrase like "Hey Siri" or "OK Google" or "Alexa". But this solution brings up other problems, since users don't like the idea of their entire life's soundtrack streaming to Apple or Google or Amazon. And anyhow, streaming everything to the Mother Ship might strain battery life and network bandwidth for some devices. The answer: Create simple, low-power device-local programs that do nothing but monitor ambient audio for the relevant magic phrase.

Problem: these programs aren't yet very good. Result: lots of false positives. Mostly the false positives are relatively benign — see e.g. "Annals of helpful surveillance", 5/9/2017. But recently, many people have been creeped out by Alexa laughing at them, apparently for no reason:

Read the rest of this entry »

Permalink Comments (21)

ASR error joke of the week

February 28, 2018 @ 6:59 pm· Filed by Mark Liberman under Computational linguistics, Humor

I suspect that this is just as unfair as the old ASR elevator in Scotland skit was, but I don't have time to try it out.

Permalink Comments (7)

Hearing interactions

February 28, 2018 @ 6:01 am· Filed by Mark Liberman under Computational linguistics, Psycholinguistics

Listen to this 3-second audio clip, and think about what you hear:

Read the rest of this entry »

Permalink Comments (28)

Flip Donkey Doodleplunk?

February 22, 2018 @ 6:33 am· Filed by Mark Liberman under Computational linguistics, Language and the law

Barton Beebe & Jeanne Fromer, "Are We Running Out of Trademarks? An Empirical Study of Trademark Depletion and Congestion", Harvard Law Review, February 2018:

Abstract: American trademark law has long operated on the assumption that there exists an inexhaustible supply of unclaimed trademarks that are at least as competitively effective as those already claimed. This core empirical assumption underpins nearly every aspect of trademark law and policy. This Article presents empirical evidence showing that this conventional wisdom is wrong. The supply of competitively effective trademarks is, in fact, exhaustible and has already reached severe levels of what we term trademark depletion and trademark congestion. We systematically study all 6.7 million trademark applications filed at the U.S. Patent and Trademark Office (PTO) from 1985 through 2016 together with the 300,000 trademarks already registered at the PTO as of 1985. We analyze these data in light of the most frequently used words and syllables in American English, the most frequently occurring surnames in the United States, and an original dataset consisting of phonetic representations of each applied-for or registered word mark included in the PTO’s Trademark Case Files Dataset. We further incorporate data consisting of all 128 million domain names registered in the .com top-level domain and an original dataset of all 2.1 million trademark office actions issued by the PTO from 2003 through 2016. These data show that rates of word-mark depletion and congestion are increasing and have reached chronic levels, particularly in certain important economic sectors. The data further show that new trademark applicants are increasingly being forced to resort to second-best, less competitively effective marks. Yet registration refusal rates continue to rise. The result is that the ecology of the trademark system is breaking down, with mounting barriers to entry, increasing consumer search costs, and an eroding public domain. In light of our empirical findings, we propose a mix of reforms to trademark law that will help to preserve the proper functioning of the trademark system and further its core purposes of promoting competition and enhancing consumer welfare.

Read the rest of this entry »

Permalink Comments (21)

Alexa disguises her name?

February 5, 2018 @ 9:04 am· Filed by Mark Liberman under Computational linguistics, Language and the media

"Alexa Loses Her Voice" won USA Today's Super Bowl Ad Meter:

I believe that this was also the first Super Bowl ad to raise a technical question about speech technology.

Read the rest of this entry »

Permalink Comments (9)

Adversarial attacks on modern speech-to-text

January 30, 2018 @ 8:56 am· Filed by Max Little under Computational linguistics, Elephant semifics

Generating adversarial STT examples.

In a post on this blog recently Mark Liberman raised the lively area of so-called "adversarial" attacks for modern machine learning systems. These attacks can do amusing and somewhat frightening things such as force an object recognition algorithm to identify all images as toasters with remarkably high confidence. Seeing these applied to image recognition, he hypothesized they could also be applied to modern speech recognition (STT, or speech-to-text) based on e.g. deep learning. His hypothesis has indeed been recently confirmed.

Read the rest of this entry »

Permalink Comments (7)

Ross Macdonald: lexical diversity over the lifespan

January 13, 2018 @ 6:17 pm· Filed by Yves Schabes under Computational linguistics, Psychology of language

This post is an initial progress report on some joint work with Mark Liberman. It's part of a larger effort to replicate and extend Xuan Le, Ian Lancashire, Graeme Hirst, & Regina Jokel, "Longitudinal detection of dementia through lexical and syntactic changes in writing: a case study of three British novelists", Literary and Linguistic Computing 2011. Their abstract:

We present a large-scale longitudinal study of lexical and syntactic changes in language in Alzheimer's disease using complete, fully parsed texts and a large number of measures, using as our subjects the British novelists Iris Murdoch (who died with Alzheimer's), Agatha Christie (who was suspected of it), and P.D. James (who has aged healthily). […] Our results support the hypothesis that signs of dementia can be found in diachronic analyses of patients’ writings, and in addition lead to new understanding of the work of the individual authors whom we studied. In particular, we show that it is probable that Agatha Christie indeed suffered from the onset of Alzheimer's while writing her last novels, and that Iris Murdoch exhibited a ‘trough’ of relatively impoverished vocabulary and syntax in her writing in her late 40s and 50s that presaged her later dementia.

Read the rest of this entry »

Permalink Comments (11)

News program presenter meets robot avatar

December 31, 2017 @ 9:56 am· Filed by Geoffrey K. Pullum under Computational linguistics, Language and computers, Linguistics in the comics, Linguistics in the news, Speech technology

Yesterday BBC's Radio 4 program "Today", the cultural counterpart of NPR's "Morning Edition", invited into the studio a robot from the University of Sheffield, the Mishalbot, which had been trained to conduct interviews by exposure to the on-air speech of co-presenter Mishal Husain. They let it talk for three minutes with the real Mishal. (video clip here, at least for UK readers; may not be available in the US). Once again I was appalled at the credulity of journalists when confronted with AI. Despite all the evidence that the robot was just parroting Mishalesque phrases, Ms Husain continued with the absurd charade, pretending politely that her robotic alter ego was really conversing. Afterward there was half-serious on-air discussion of the possibility that some day the jobs of the Today program presenters and interviewers might be taken over by robots.

The main thing differentiating the Sheffield robot from Joseph Weizenbaum's ELIZA program of 1966 (apart from a babyish plastic face and movable fingers and eyes, which didn't work well on radio) was that the Mishalbot is voice-driven (with ELIZA you had to type on a terminal). So the main technological development has been in speech recognition engineering. On interaction, the Mishalbot seemed to me to be at sub-ELIZA level. "What do you mean? Can you give an example?" it said repeatedly, at various inappropriate points.

Read the rest of this entry »

Permalink Comments off

A virus that fixes your grammar

December 8, 2017 @ 5:16 am· Filed by Geoffrey K. Pullum under Computational linguistics, Grammar, Information technology, Language and the media, Usage advice

In today's Dilbert strip, Dilbert is confused by why the company mission statement looks so different, and Alice diagnoses what's happened: the Elbonian virus that has been corrupting the company's computer systems has fixed all the grammar and punctuation errors it formerly contained.

That'll be the day. Right now, computational linguists with an unlimited budget (and unlimited help from Elbonian programmers) would be unable to develop a trustworthy program that could proactively fix grammar and punctuation errors in written English prose. We simply don't know enough. The "grammar checking" programs built into word processors like Microsoft Word are dire, even risible, catching only a limited list of shibboleths and being wrong about many of them. Flagging split infinitives, passives, and random colloquialisms as if they were all errors is not much help to you, especially when many sequences are flagged falsely. Following all of Word's suggestions for changes would creat gibberish. Free-standing tools like Grammarly are similarly hopeless. They merely read and note possible "errors", leaving you to make corrections. They couldn't possibly be modified into programs that would proactively correct your prose. Take the editing error in this passage, which Rodney Huddleston recently noticed in a quality newspaper, The Australian:

There has been no glimmer of light from the Palestinian Authority since the Oslo Accords were signed, just the usual intransigence that even the wider Arab world may be tiring of. Yet the West, the EU, nor the UN, have never made the PA pay a price for its intransigence.

Read the rest of this entry »

Permalink Comments off

Woo

November 24, 2017 @ 9:18 am· Filed by Mark Liberman under Computational linguistics

I accidentally texted my wife with voice recognition…while playing the trombone pic.twitter.com/tWCPSXbbrO

— Paul The Trombonist (@JazzTrombonist) November 21, 2017

Read the rest of this entry »

Permalink Comments (10)

Linguistic Science and Technology in China

November 12, 2017 @ 8:33 am· Filed by Mark Liberman under Computational linguistics, Language and politics

I just spent a few days in China, mainly to attend an "International Workshop on Language Resource Construction: Theory, Methodology and Applications". This was the second event in a three-year program funded by a small grant from the "Penn China Research & Engagement Fund". That program's goals include "To develop new, or strengthen existing, institutional and faculty-to-faculty relationships with Chinese partners", and our proposal focused on "linguistic diversity in China, with specific emphasis on the documentation of variation in standard, regional and minority languages".

After last year's workshop at the Penn Wharton China Center, some Chinese colleagues (Zhifang Sui and Weidong Zhan from the Key Laboratory of Computational Linguistics and the Center for Chinese Linguistics at Peking University) suggested that we join them in co-sponsoring a two-day workshop this fall, with the first day at PKU and the second day at the PWCC. Here's the group photo from the first day (11/5/2017):

The growing strength of Chinese research in the various areas of linguistic science and technology has been clear for some time, and the presentations and discussions at this workshop made it clear that this work is poised for a further major increase in quantity and quality.

Read the rest of this entry »

Permalink Comments (11)

You need to know something

October 25, 2017 @ 4:12 am· Filed by Mark Liberman under Computational linguistics, Elephant semifics

I'm happy to see that Google Translate is still turning (many types of) meaningless character sequences into spoken-word poetry. Repetitions of single hiragana characters are an especially reliable source — here's "You need to know something":

Read the rest of this entry »

Permalink Comments (15)

« Previous Page — « Previous Entries

Next Entries » — Next Page »

Archive for Computational linguistics

Oxford-NINJAL Corpus of Old Japanese

Alexa laughs

ASR error joke of the week

Hearing interactions

Flip Donkey Doodleplunk?

Alexa disguises her name?

Adversarial attacks on modern speech-to-text

Ross Macdonald: lexical diversity over the lifespan

News program presenter meets robot avatar

A virus that fixes your grammar

Woo

Linguistic Science and Technology in China

You need to know something

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta