Archive for Computational linguistics

English as Afrikaans?

Language-identification from digital text has been a solved problem for many years, so I was surprised yesterday to see Gmail offering to translate from Afrikaans an email written in perfectly idiomatic English, which started this way:

Read the rest of this entry »

Comments (10)

Deep fake audio

Helen Rosner, "A Haunting New Documentary About Anthony Bourdain", The New Yorker 7/15/2021:

It’s been three years since Anthony Bourdain died, by suicide, in June of 2018, and the void he left is still a void. […]

In 2019, about a year after Bourdain’s death, the documentary filmmaker Morgan Neville began talking to people who had been close to Bourdain: his family, his friends, the producers and crew of his television series. “These were the hardest interviews I’ve ever done, hands down,” he told me. “I was the grief counsellor, who showed up to talk to everybody.” […]

There is a moment at the end of the film’s second act when the artist David Choe, a friend of Bourdain’s, is reading aloud an e-mail Bourdain had sent him: “Dude, this is a crazy thing to ask, but I’m curious” Choe begins reading, and then the voice fades into Bourdain’s own: “. . . and my life is sort of shit now. You are successful, and I am successful, and I’m wondering: Are you happy?” I asked Neville how on earth he’d found an audio recording of Bourdain reading his own e-mail. Throughout the film, Neville and his team used stitched-together clips of Bourdain’s narration pulled from TV, radio, podcasts, and audiobooks. “But there were three quotes there I wanted his voice for that there were no recordings of,” Neville explained. So he got in touch with a software company, gave it about a dozen hours of recordings, and, he said, “I created an A.I. model of his voice.” In a world of computer simulations and deepfakes, a dead man’s voice speaking his own words of despair is hardly the most dystopian application of the technology. But the seamlessness of the effect is eerie. “If you watch the film, other than that line you mentioned, you probably don’t know what the other lines are that were spoken by the A.I., and you’re not going to know,” Neville said. “We can have a documentary-ethics panel about it later.”

Read the rest of this entry »

Comments (3)

Publication penalties

Amanda D'Ambrosio, "Mayo Physician Fired Over COVID Book", MedPage Today 6/24/2021:

After publishing a book about his experience on the front lines during the COVID-19 pandemic, a physician was fired from his position at the Mayo Clinic this month, he confirmed to MedPage Today.

Steven Weiss, MD, an internist who practiced at the clinic's Eau Claire, Wisconsin location for 32 years, stated that he was terminated because he identified himself as an employee of the Mayo Clinic in his new book, called Carnage in America: COVID-19, Racial Injustice, and the Demise of Donald Trump.

According to a June 4 termination letter shared with MedPage Today, Mayo Clinic administrators told Weiss, 62, that his actions violated the health system's publishing policy, as he did not submit his manuscript to the institution for review before it was printed.

"I'm still in shock that I was terminated for this," Weiss said in an interview with MedPage Today. "I had no idea that they would claim a right to pre-vet a book before publication."

Read the rest of this entry »

Comments (35)

Stochastic parrots

Long, but worth reading — Tom Simonite, "What Really Happened When Google Ousted Timnit Gebru", Wired 6/8/2021.

The crux of the story is this paper, which is now available on the ACM's website: Emily Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell, "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?🦜" In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 610-623. 2021.

As a result of a (somewhat strange) review process, described at length in the Wired article, Timnit Gebru and Margaret Mitchell were fired (or declared to have resigned) from their leadership roles in Google's Ethical AI group.

Read the rest of this entry »

Comments (18)

"This massive monster of incomprehensibility"

Atul Gawande, "Why doctors hate their computers", 11/5/2018, underlines the often-noted difficulty of working with badly-designed software:

I’ve come to feel that a system that promised to increase my mastery over my work has, instead, increased my work’s mastery over me. I’m not the only one. A 2016 study found that physicians spent about two hours doing computer work for every hour spent face to face with a patient—whatever the brand of medical software. In the examination room, physicians devoted half of their patient time facing the screen to do electronic tasks. And these tasks were spilling over after hours. 

But the most interesting part of the article, at least for me, was the discussion of reading the  records rather than writing them.

Read the rest of this entry »

Comments (17)

Meta-methodology

Comments (5)

Advances in topic modeling

In the middle to late 1990s, "Topic Detection and Tracking" was an active research area (see also this). And by the early 2000s, the technology was good enough to support the creation of Google News. Twenty years later, these and other innovations have transformed the mass media, for good or ill. I don't know what algorithms the AI in charge of Topic Modeling at Google News is using these days, but I'm happy to see it developing a sense of humor:

Read the rest of this entry »

Comments (21)

The 17th annual Blizzard Challenge

In today's email, an announcement for the 17th annual Blizzard Challenge:

We are delighted to call for participation in the Blizzard Challenge 2021. This is an open evaluation of corpus-based speech synthesis systems using common datasets and a large listening test.

This year, the challenge will provide a European Spanish speech dataset from one native speaker. The dataset was offered by iFLYTEK Co. Ltd. and is now available for downloading after registration and completing the license.
The two tasks involve building voices from this data to synthesise texts containing only Spanish words and to synthesise Spanish texts containing a small number of English words in each sentence.
Please read the full announcement and the rules at:

http://www.synsig.org/index.php/Blizzard_Challenge_2021

Please register by following the instructions on the web page, then wait for your registration to be accepted before completing the data license.

Important: please send all communications about Blizzard to the official address blizzard@festvox.org and not to our personal addresses.

Please feel free to distribute this announcement to other relevant mailing lists.

Regards,
Zhenhua Ling & Simon King

steering committee: Alan Black, Keiichi Tokuda, Simon King

Read the rest of this entry »

Comments (4)

Ted Cruz in big trouble

Ben Hull writes:

In our Computational Linguistics class we were discussing different methods of segmenting Chinese character texts. Today I came across a terrific example of the problems of segmenting left to right, in the first sentence of the attached image. I hope you find it as amusing as I did.

Read the rest of this entry »

Comments (6)

Data vs. information

[This is a guest post by Conal Boyce]

The following was drafted as an Appendix to a project whose working title is "The Emperor's New Information" (after Penrose, The Emperor's New Mind). It's still a work-in-progress, so feedback would be welcome. For example: Are the two examples persuasive? Do they need technical clarification or correction? Have others at LL noticed how certain authors "who should know better" use the term information where data is dictated by the context, or employ the two terms at random, as if they were synonyms?

Read the rest of this entry »

Comments (35)

MuRIL

[Note that the "To view or add a comment" message is from LinkinIn, not LLOG…]

Read the rest of this entry »

Comments (7)

A Real Character, and a Philosophical Language

A couple of decades ago, in response to a long-forgotten taxonomic proposal, I copied into antique html Jorge Luis Borges' essay "El Idioma Analítico de John Wilkins", along with an English translation. This afternoon, a reading-group discussion about algorithms for topic classification brought up the idea of a single universal tree-structured taxonomy of topics, and this reminded me again of what Borges had to say about Wilkins' 1668 treatise "An Essay Towards a Real Character, And a Philosophical Language". You should read the whole of Borges' essay, but the relevant passage for computational taxonomists is this:

[N]otoriamente no hay clasificación del universo que no sea arbitraria y conjetural. La razón es muy simple: no sabemos qué cosa es el universo. "El mundo – escribe David Hume – es tal vez el bosquejo rudimentario de algún dios infantil, que lo abandonó a medio hacer, avergonzado de su ejecución deficiente; es obra de un dios subalterno, de quien los dioses superiores se burlan; es la confusa producción de una divinidad decrépita y jubilada, que ya se ha muerto" (Dialogues Concerning Natural Religion, V. 1779). Cabe ir más lejos; cabe sospechar que no hay universo en el sentido orgánico, unificador, que tiene esa ambiciosa palabra. Si lo hay, falta conjeturar su propósito; falta conjeturar las palabras, las definiciones, las etimologías, las sinonimias, del secreto diccionario de Dios.

[I]t is clear that there is no classification of the Universe that is not arbitrary and full of conjectures. The reason for this is very simple: we do not know what thing the universe is. "The world – David Hume writes – is perhaps the rudimentary sketch of a childish god, who left it half done, ashamed by his deficient work; it is created by a subordinate god, at whom the superior gods laugh; it is the confused production of a decrepit and retiring divinity, who has already died" ('Dialogues Concerning Natural Religion', V. 1779). We are allowed to go further; we can suspect that there is no universe in the organic, unifying sense, that this ambitious term has. If there is a universe, its aim is not conjectured yet; we have not yet conjectured the words, the definitions, the etymologies, the synonyms, from the secret dictionary of God.

Read the rest of this entry »

Comments (22)

Interview with Charles Yang

Charles Yang* is perhaps best known for the development of the Tolerance Principle, a way to quantify and predict (given some input) whether a rule will become productive. He is currently Professor of Linguistics at the University of Pennsylvania, where he collaborates with various researchers around the world to test and extend the Tolerance Principle and gain greater insight into the mechanisms underlying language acquisition.

 

How did you get into Computational Linguistics?

I’ve always been a computer scientist, I never really took any linguistics classes and I was interested in compilers. I was doing AI, so it was kind of natural to think about how human languages were parsed. I remember going to the library looking for stuff like this and I stumbled onto the book “Principle Based Parsing” which was an edited volume and it was incomprehensible. It was fascinating, actually, I wrote [Noam] Chomsky a physical letter way back in the day when I was a kid in Ohio and he kindly replied and said things like there’s recent work in syntax and so on. That was one of the reasons I applied to MIT to do computer science because I was attracted to the work of Bob Berwick who was the initiator of principle based parsing at the time. While doing that, I also ran across Mitch Marcus’s book. I don’t think I quite understood everything he was saying there but his idea of deriving syntactic constraints from parsing strategies was very interesting. I started reading Lectures on Government & Binding among other things. I applied to MIT, I got in. I had some marginal interests in vision, I was very attracted to Shimon Ullman’s work on the psychophysical constraints of vision. [It was] very much out of the Marrian program as opposed to what was beginning to become common, which was this image processing based approach to vision which was just applied data analysis which didn’t quite interest me as much.

Read the rest of this entry »

Comments (1)