Language Log

Archive for Computational linguistics

Data vs. information

February 7, 2021 @ 3:44 am· Filed by Victor Mair under Computational linguistics, Information technology, Language and biology, Language and science

[This is a guest post by Conal Boyce]

The following was drafted as an Appendix to a project whose working title is "The Emperor's New Information" (after Penrose, The Emperor's New Mind). It's still a work-in-progress, so feedback would be welcome. For example: Are the two examples persuasive? Do they need technical clarification or correction? Have others at LL noticed how certain authors "who should know better" use the term information where data is dictated by the context, or employ the two terms at random, as if they were synonyms?

Read the rest of this entry »

Permalink Comments (35)

MuRIL

December 19, 2020 @ 8:51 am· Filed by Mark Liberman under Computational linguistics

[Note that the "To view or add a comment" message is from LinkinIn, not LLOG…]

Read the rest of this entry »

Permalink Comments (7)

A Real Character, and a Philosophical Language

December 1, 2020 @ 7:31 am· Filed by Mark Liberman under Computational linguistics, Philosophy of Language

A couple of decades ago, in response to a long-forgotten taxonomic proposal, I copied into antique html Jorge Luis Borges' essay "El Idioma Analítico de John Wilkins", along with an English translation. This afternoon, a reading-group discussion about algorithms for topic classification brought up the idea of a single universal tree-structured taxonomy of topics, and this reminded me again of what Borges had to say about Wilkins' 1668 treatise "An Essay Towards a Real Character, And a Philosophical Language". You should read the whole of Borges' essay, but the relevant passage for computational taxonomists is this:

[N]otoriamente no hay clasificación del universo que no sea arbitraria y conjetural. La razón es muy simple: no sabemos qué cosa es el universo. "El mundo – escribe David Hume – es tal vez el bosquejo rudimentario de algún dios infantil, que lo abandonó a medio hacer, avergonzado de su ejecución deficiente; es obra de un dios subalterno, de quien los dioses superiores se burlan; es la confusa producción de una divinidad decrépita y jubilada, que ya se ha muerto" (Dialogues Concerning Natural Religion, V. 1779). Cabe ir más lejos; cabe sospechar que no hay universo en el sentido orgánico, unificador, que tiene esa ambiciosa palabra. Si lo hay, falta conjeturar su propósito; falta conjeturar las palabras, las definiciones, las etimologías, las sinonimias, del secreto diccionario de Dios.

[I]t is clear that there is no classification of the Universe that is not arbitrary and full of conjectures. The reason for this is very simple: we do not know what thing the universe is. "The world – David Hume writes – is perhaps the rudimentary sketch of a childish god, who left it half done, ashamed by his deficient work; it is created by a subordinate god, at whom the superior gods laugh; it is the confused production of a decrepit and retiring divinity, who has already died" ('Dialogues Concerning Natural Religion', V. 1779). We are allowed to go further; we can suspect that there is no universe in the organic, unifying sense, that this ambitious term has. If there is a universe, its aim is not conjectured yet; we have not yet conjectured the words, the definitions, the etymologies, the synonyms, from the secret dictionary of God.

Read the rest of this entry »

Permalink Comments (22)

Interview with Charles Yang

November 12, 2020 @ 9:54 am· Filed by Victor Gomes under Computational linguistics, Language acquisition, People

Charles Yang* is perhaps best known for the development of the Tolerance Principle, a way to quantify and predict (given some input) whether a rule will become productive. He is currently Professor of Linguistics at the University of Pennsylvania, where he collaborates with various researchers around the world to test and extend the Tolerance Principle and gain greater insight into the mechanisms underlying language acquisition.

How did you get into Computational Linguistics?

I’ve always been a computer scientist, I never really took any linguistics classes and I was interested in compilers. I was doing AI, so it was kind of natural to think about how human languages were parsed. I remember going to the library looking for stuff like this and I stumbled onto the book “Principle Based Parsing” which was an edited volume and it was incomprehensible. It was fascinating, actually, I wrote [Noam] Chomsky a physical letter way back in the day when I was a kid in Ohio and he kindly replied and said things like there’s recent work in syntax and so on. That was one of the reasons I applied to MIT to do computer science because I was attracted to the work of Bob Berwick who was the initiator of principle based parsing at the time. While doing that, I also ran across Mitch Marcus’s book. I don’t think I quite understood everything he was saying there but his idea of deriving syntactic constraints from parsing strategies was very interesting. I started reading Lectures on Government & Binding among other things. I applied to MIT, I got in. I had some marginal interests in vision, I was very attracted to Shimon Ullman’s work on the psychophysical constraints of vision. [It was] very much out of the Marrian program as opposed to what was beginning to become common, which was this image processing based approach to vision which was just applied data analysis which didn’t quite interest me as much.

Read the rest of this entry »

Permalink Comments (1)

Purchase wine, buy beer

October 19, 2020 @ 1:19 pm· Filed by Mark Liberman under Computational linguistics

30 years ago, Don Hindle explored the idea of calculating semantic similarity on the basis of predicate-argument relations in text corpora, and in the context of that work, I remember him noting that we tend to purchase wine but buy beer. He didn't have a lot of evidence for that insight, since he was working with a mere six-million-word corpus of Associated Press news stories, in which the available counts were small:

	wine	beer
purchase	1	0
buy	0	3

So for today's lecture on semantics for ling001, I thought I'd check the counts in one of the larger collections available today, as an example of the weaker types of connotational meaning.

Read the rest of this entry »

Permalink Comments (19)

18th-century RNA research

September 26, 2020 @ 5:44 am· Filed by Mark Liberman under Computational linguistics

As I was looking into the history of term biomarker, Google Scholar reminded me that automatic information extraction from text remains imperfect:

Google Scholar's translation into APA format:

Crea, F., Watahiki, A., & Quagliata, L. (1769). Identification of a long non-coding RNA as a novel biomarker and potential therapeutic target for metastatic prostate cancer. Oncotarget 5, 764–774.

Read the rest of this entry »

Permalink Comments (9)

The computational linguistics of COVID-19 vaccine design

July 27, 2020 @ 4:01 pm· Filed by Mark Liberman under Computational linguistics

He Zhang, Liang Zhang, Ziyu Li, Kaibo Liu, Boxiang Liu, David H. Mathews, and Liang Huang, "LinearDesign: Efficient Algorithms for Optimized mRNA Sequence Design", arXiv.org 4/21/2020:

A messenger RNA (mRNA) vaccine has emerged as a promising direction to combat the current COVID-19 pandemic. This requires an mRNA sequence that is stable and highly productive in protein expression, features which have been shown to benefit from greater mRNA secondary structure folding stability and optimal codon usage. However, sequence design remains a hard problem due to the exponentially many synonymous mRNA sequences that encode the same protein. We show that this design problem can be reduced to a classical problem in formal language theory and computational linguistics that can be solved in O(n^3) time, where n is the mRNA sequence length. This algorithm could still be too slow for large n (e.g., n = 3, 822 nucleotides for the spike protein of SARS-CoV-2), so we further developed a linear-time approximate version, LinearDesign, inspired by our recent work, LinearFold. This algorithm, LinearDesign, can compute the approximate minimum free energy mRNA sequence for this spike protein in just 11 minutes using beam size b = 1, 000, with only 0.6% loss in free energy change compared to exact search (i.e., b = +infinity, which costs 1 hour). We also develop two algorithms for incorporating the codon optimality into the design, one based on k-best parsing to find alternative sequences and one directly incorporating codon optimality into the dynamic programming. Our work provides efficient computational tools to speed up and improve mRNA vaccine development.

Read the rest of this entry »

Permalink Comments (1)

The Scunthorpe effect rides again

May 28, 2020 @ 8:29 am· Filed by Mark Liberman under Computational linguistics, Humor

Alex Hern, "Anti-porn filters stop Dominic Cummings trending on Twitter", The Guardian 5/27/2020:

Twitter’s anti-porn filters have blocked Dominic Cummings’ name from its list of trending topics despite Boris Johnson’s chief adviser dominating British political news for almost a week, the Guardian can reveal.

As a result of the filtering, trending topics over the past five days have instead included a variety of misspellings of his name, including #cummnings, #dominiccummigs and #sackcummimgs, as well as his first name on its own, the hashtag #sackdom, and the place names Durham, County Durham and Barnard Castle.

The filter also affects suggested hashtags, meaning users who tried to type #dominiccummings were instead presented with one of the misspelled variations to auto-complete, helping them trend instead.

This sort of accidental filtering has gained a name in computer science: the Scunthorpe problem, so-called because of the Lincolnshire town’s regular issues with such censorship.

Read the rest of this entry »

Permalink Comments (15)

Mama Drama

May 27, 2020 @ 12:21 pm· Filed by Mark Liberman under Computational linguistics

So-called "verbal fluency" is one of the tasks we're using in the first iteration of the SpeechBiomarkers project (and please participate if you haven't done so!). Despite the test's name, it doesn't really measure verbal fluency, but rather asks subjects to name as many words as possible from some category in 60 seconds, like "animals" or "words starting with the letter F".

Here's the first ten seconds of one participant's "animals" response:

As you can hear, the audio quality is quite good, although it was recorded remotely using the participant's browser and their system's standard microphone. These days, standard hardware usually has pretty good built-in facilities for voice recording.

In order to automate the analysis, we need a speech-to-text system that can do a good enough job on data of this kind. As I've noted in earlier posts, we're not there yet for picture descriptions ("Shelties On Alki Story Forest", 11/26/2019; "The right boot of the warner of the baron", 12/6/2019). For "fluency" recordings, the error rate is worse — but maybe we're actually closer to a solution, as I'll explain below.

Read the rest of this entry »

Permalink Comments (5)

Chatbot comedy

May 20, 2020 @ 12:31 pm· Filed by Mark Liberman under Artificial intelligence, Computational linguistics

Unfortunately, most customer service chatbots are not nearly this good:

Customer Service Calls pic.twitter.com/CbU67Hpfts

— Jeff Wright (@JeffRightNoww) May 19, 2020

Permalink Comments (1)

Zoom time

May 15, 2020 @ 4:58 am· Filed by Mark Liberman under Acoustics, Computational linguistics

I'm involved with several projects that analyze recordings from e-interviews conducted using systems like Zoom, Bluejeans, and WebEx. Some of our analysis techniques rely on timing information, and so it's natural to wonder how much distortion might be introduced by those systems' encoding, transmission, and decoding processes.

Why might timing in particular be distorted? Because any internet-based audio or video streaming system encodes the signal at the source into a series of fairly small packets, sends them individually by diverse routes to the destination, and then assembles them again at the end.

If the transmission is one-way, then the system can introduce a delay that's long enough to ensure that all the diversely-routed packets get to the destination in time to be reassembled in the proper order — maybe a couple of seconds of buffering. But for a conversational system, that kind of latency disrupts communication, and so the buffering delays used by broadcasters and streaming services are not possible. As a result, there may be missing packets at decoding time, and the system has to deal with that by skipping, repeating, or interpolating (the signal derived from) packets, or by just freezing up for a while.

It's not clear (at least to me) how much of this happens when, or how to monitor it. (Though it's easy to see that the video signal in such conversations is often coarsely sampled or even "frozen", and obvious audio glitches sometimes occur as well.) But the results of a simple test suggest that more subtle time distortion is sometimes a problem for the audio channel as well.

Read the rest of this entry »

Permalink Comments (9)

Speaker change detection

April 26, 2020 @ 12:27 pm· Filed by Mark Liberman under Computational linguistics

A couple of years ago ("Hearing interactions", 2/28/2018), I posted some anecdotal evidence that human perception of speaker change is accurate and usually also pretty fast. I noted that the performance of automatic systems at analogous tasks was distinctly underwhelming in comparison.

A recent paper measures human performance more systematically, and compares a state-of-the art program — Neeraj Sharma et al., "On the impact of language familiarity in talker change detection", ICASSP 2020:

The ability to detect talker changes when listening to conversational speech is fundamental to perception and understanding of multitalker speech. In this paper, we propose an experimental paradigm to provide insights on the impact of language familiarity on talker change detection. Two multi-talker speech stimulus sets, one in a language familiar to the listeners (English) and the other unfamiliar (Chinese), are created. A listening test is performed in which listeners indicate the number of talkers in the presented stimuli. Analysis of human performance shows statistically significant results for: (a) lower miss (and a higher false alarm) rate in familiar versus unfamiliar language, and (b) longer response time in familiar versus unfamiliar language. These results signify a link between perception of talker attributes and language proficiency. Subsequently, a machine system is designed to perform the same task. The system makes use of the current state-of-the-art diarization approach with x-vector embeddings. A performance comparison on the same stimulus set indicates that the machine system falls short of human performance by a huge margin, for both languages.

Permalink Comments (4)

Lexical display rates in novels

April 18, 2020 @ 11:46 am· Filed by Mark Liberman under Computational linguistics

In some on-going research on linguistic features relating to clinical diagnosis and tracking, we've been looking at "lexical diversity". It's easy to measure the rate of vocabulary display — you can just use a type-token graph, which shows the count of distinct words ("types") against the count of total words ("tokens"). It's less obvious how to turn such a curve into a single number that can be compared across sources — for a survey of some alternative measures, see e.g. Scott Jarvis, "Short texts, best-fitting curves and new measures of lexical diversity", Language Testing 2002; and for the measure that we've settled on, see Michael Covington and Joe McFall, "Cutting the Gordian knot: The moving-average type–token ratio (MATTR)", Journal of quantitative linguistics 2010. More on that later.

For now, I want to make a point that depends only on type-token graphs. Over time, I've accumulated a small private digital corpus of more than 100 English-language fiction titles, from Tristram Shandy forward to 2019. It's clear that different authors have different characteristic rates of vocabulary display, and for today's post, I want to present the authors in my collection with the highest and lowest characteristic rates.

Read the rest of this entry »

Permalink Comments (5)

« Previous Page — « Previous Entries

Next Entries » — Next Page »

Archive for Computational linguistics

Data vs. information

MuRIL

A Real Character, and a Philosophical Language

Interview with Charles Yang

Purchase wine, buy beer

18th-century RNA research

The computational linguistics of COVID-19 vaccine design

The Scunthorpe effect rides again

Mama Drama

Chatbot comedy

Zoom time

Speaker change detection

Lexical display rates in novels

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta