Archive for Computational linguistics

Advances in topic modeling

In the middle to late 1990s, "Topic Detection and Tracking" was an active research area (see also this). And by the early 2000s, the technology was good enough to support the creation of Google News. Twenty years later, these and other innovations have transformed the mass media, for good or ill. I don't know what algorithms the AI in charge of Topic Modeling at Google News is using these days, but I'm happy to see it developing a sense of humor:

Read the rest of this entry »

Comments (21)

The 17th annual Blizzard Challenge

In today's email, an announcement for the 17th annual Blizzard Challenge:

We are delighted to call for participation in the Blizzard Challenge 2021. This is an open evaluation of corpus-based speech synthesis systems using common datasets and a large listening test.

This year, the challenge will provide a European Spanish speech dataset from one native speaker. The dataset was offered by iFLYTEK Co. Ltd. and is now available for downloading after registration and completing the license.
The two tasks involve building voices from this data to synthesise texts containing only Spanish words and to synthesise Spanish texts containing a small number of English words in each sentence.
Please read the full announcement and the rules at:

http://www.synsig.org/index.php/Blizzard_Challenge_2021

Please register by following the instructions on the web page, then wait for your registration to be accepted before completing the data license.

Important: please send all communications about Blizzard to the official address blizzard@festvox.org and not to our personal addresses.

Please feel free to distribute this announcement to other relevant mailing lists.

Regards,
Zhenhua Ling & Simon King

steering committee: Alan Black, Keiichi Tokuda, Simon King

Read the rest of this entry »

Comments (4)

Ted Cruz in big trouble

Ben Hull writes:

In our Computational Linguistics class we were discussing different methods of segmenting Chinese character texts. Today I came across a terrific example of the problems of segmenting left to right, in the first sentence of the attached image. I hope you find it as amusing as I did.

Read the rest of this entry »

Comments (6)

Data vs. information

[This is a guest post by Conal Boyce]

The following was drafted as an Appendix to a project whose working title is "The Emperor's New Information" (after Penrose, The Emperor's New Mind). It's still a work-in-progress, so feedback would be welcome. For example: Are the two examples persuasive? Do they need technical clarification or correction? Have others at LL noticed how certain authors "who should know better" use the term information where data is dictated by the context, or employ the two terms at random, as if they were synonyms?

Read the rest of this entry »

Comments (35)

MuRIL

[Note that the "To view or add a comment" message is from LinkinIn, not LLOG…]

Read the rest of this entry »

Comments (7)

A Real Character, and a Philosophical Language

A couple of decades ago, in response to a long-forgotten taxonomic proposal, I copied into antique html Jorge Luis Borges' essay "El Idioma Analítico de John Wilkins", along with an English translation. This afternoon, a reading-group discussion about algorithms for topic classification brought up the idea of a single universal tree-structured taxonomy of topics, and this reminded me again of what Borges had to say about Wilkins' 1668 treatise "An Essay Towards a Real Character, And a Philosophical Language". You should read the whole of Borges' essay, but the relevant passage for computational taxonomists is this:

[N]otoriamente no hay clasificación del universo que no sea arbitraria y conjetural. La razón es muy simple: no sabemos qué cosa es el universo. "El mundo – escribe David Hume – es tal vez el bosquejo rudimentario de algún dios infantil, que lo abandonó a medio hacer, avergonzado de su ejecución deficiente; es obra de un dios subalterno, de quien los dioses superiores se burlan; es la confusa producción de una divinidad decrépita y jubilada, que ya se ha muerto" (Dialogues Concerning Natural Religion, V. 1779). Cabe ir más lejos; cabe sospechar que no hay universo en el sentido orgánico, unificador, que tiene esa ambiciosa palabra. Si lo hay, falta conjeturar su propósito; falta conjeturar las palabras, las definiciones, las etimologías, las sinonimias, del secreto diccionario de Dios.

[I]t is clear that there is no classification of the Universe that is not arbitrary and full of conjectures. The reason for this is very simple: we do not know what thing the universe is. "The world – David Hume writes – is perhaps the rudimentary sketch of a childish god, who left it half done, ashamed by his deficient work; it is created by a subordinate god, at whom the superior gods laugh; it is the confused production of a decrepit and retiring divinity, who has already died" ('Dialogues Concerning Natural Religion', V. 1779). We are allowed to go further; we can suspect that there is no universe in the organic, unifying sense, that this ambitious term has. If there is a universe, its aim is not conjectured yet; we have not yet conjectured the words, the definitions, the etymologies, the synonyms, from the secret dictionary of God.

Read the rest of this entry »

Comments (22)

Interview with Charles Yang

Charles Yang* is perhaps best known for the development of the Tolerance Principle, a way to quantify and predict (given some input) whether a rule will become productive. He is currently Professor of Linguistics at the University of Pennsylvania, where he collaborates with various researchers around the world to test and extend the Tolerance Principle and gain greater insight into the mechanisms underlying language acquisition.

 

How did you get into Computational Linguistics?

I’ve always been a computer scientist, I never really took any linguistics classes and I was interested in compilers. I was doing AI, so it was kind of natural to think about how human languages were parsed. I remember going to the library looking for stuff like this and I stumbled onto the book “Principle Based Parsing” which was an edited volume and it was incomprehensible. It was fascinating, actually, I wrote [Noam] Chomsky a physical letter way back in the day when I was a kid in Ohio and he kindly replied and said things like there’s recent work in syntax and so on. That was one of the reasons I applied to MIT to do computer science because I was attracted to the work of Bob Berwick who was the initiator of principle based parsing at the time. While doing that, I also ran across Mitch Marcus’s book. I don’t think I quite understood everything he was saying there but his idea of deriving syntactic constraints from parsing strategies was very interesting. I started reading Lectures on Government & Binding among other things. I applied to MIT, I got in. I had some marginal interests in vision, I was very attracted to Shimon Ullman’s work on the psychophysical constraints of vision. [It was] very much out of the Marrian program as opposed to what was beginning to become common, which was this image processing based approach to vision which was just applied data analysis which didn’t quite interest me as much.

Read the rest of this entry »

Comments (1)

Purchase wine, buy beer

30 years ago, Don Hindle explored the idea of calculating semantic similarity on the basis of predicate-argument relations in text corpora, and in the context of that work, I remember him noting that we tend to purchase wine but buy beer. He didn't have a lot of evidence for that insight, since he was working with a mere six-million-word corpus of Associated Press news stories, in which the available counts were small:

wine beer
purchase 1 0
buy 0 3

So for today's lecture on semantics for ling001, I thought I'd check the counts in one of the larger collections available today, as an example of the weaker types of connotational meaning.

Read the rest of this entry »

Comments (19)

18th-century RNA research

As I was looking into the history of term biomarker, Google Scholar reminded me that automatic information extraction from text remains imperfect:

Google Scholar's translation into APA format:

Crea, F., Watahiki, A., & Quagliata, L. (1769). Identification of a long non-coding RNA as a novel biomarker and potential therapeutic target for metastatic prostate cancer. Oncotarget 5, 764–774.

Read the rest of this entry »

Comments (9)

The computational linguistics of COVID-19 vaccine design

He Zhang, Liang Zhang, Ziyu Li, Kaibo Liu, Boxiang Liu, David H. Mathews, and Liang Huang, "LinearDesign: Efficient Algorithms for Optimized mRNA Sequence Design", arXiv.org 4/21/2020:

A messenger RNA (mRNA) vaccine has emerged as a promising direction to combat the current COVID-19 pandemic. This requires an mRNA sequence that is stable and highly productive in protein expression, features which have been shown to benefit from greater mRNA secondary structure folding stability and optimal codon usage. However, sequence design remains a hard problem due to the exponentially many synonymous mRNA sequences that encode the same protein. We show that this design problem can be reduced to a classical problem in formal language theory and computational linguistics that can be solved in O(n^3) time, where n is the mRNA sequence length. This algorithm could still be too slow for large n (e.g., n = 3, 822 nucleotides for the spike protein of SARS-CoV-2), so we further developed a linear-time approximate version, LinearDesign, inspired by our recent work, LinearFold. This algorithm, LinearDesign, can compute the approximate minimum free energy mRNA sequence for this spike protein in just 11 minutes using beam size b = 1, 000, with only 0.6% loss in free energy change compared to exact search (i.e., b = +infinity, which costs 1 hour). We also develop two algorithms for incorporating the codon optimality into the design, one based on k-best parsing to find alternative sequences and one directly incorporating codon optimality into the dynamic programming. Our work provides efficient computational tools to speed up and improve mRNA vaccine development.

Read the rest of this entry »

Comments (1)

The Scunthorpe effect rides again

Alex Hern, "Anti-porn filters stop Dominic Cummings trending on Twitter", The Guardian 5/27/2020:

Twitter’s anti-porn filters have blocked Dominic Cummings’ name from its list of trending topics despite Boris Johnson’s chief adviser dominating British political news for almost a week, the Guardian can reveal.

As a result of the filtering, trending topics over the past five days have instead included a variety of misspellings of his name, including #cummnings, #dominiccummigs and #sackcummimgs, as well as his first name on its own, the hashtag #sackdom, and the place names Durham, County Durham and Barnard Castle.

The filter also affects suggested hashtags, meaning users who tried to type #dominiccummings were instead presented with one of the misspelled variations to auto-complete, helping them trend instead.

This sort of accidental filtering has gained a name in computer science: the Scunthorpe problem, so-called because of the Lincolnshire town’s regular issues with such censorship.

Read the rest of this entry »

Comments (15)

Mama Drama

So-called "verbal fluency" is one of the tasks we're using in the first iteration of the SpeechBiomarkers project (and please participate if you haven't done so!). Despite the test's name, it doesn't really measure verbal fluency, but rather asks subjects to name as many words as possible from some category in 60 seconds, like "animals" or "words starting with the letter F".

Here's the first ten seconds of one participant's "animals" response:

As you can hear, the audio quality is quite good, although it was recorded remotely using the participant's browser and their system's standard microphone. These days, standard hardware usually has pretty good built-in facilities for voice recording.

In order to automate the analysis, we need a speech-to-text system that can do a good enough job on data of this kind. As I've noted in earlier posts, we're not there yet for picture descriptions ("Shelties On Alki Story Forest", 11/26/2019; "The right boot of the warner of the baron", 12/6/2019). For "fluency" recordings, the error rate is worse — but maybe we're actually closer to a solution, as I'll explain below.

Read the rest of this entry »

Comments (5)

Chatbot comedy

Unfortunately, most customer service chatbots are not nearly this good:

Comments (1)