The computational linguistics of COVID-19 vaccine design

« previous post | next post »

He Zhang, Liang Zhang, Ziyu Li, Kaibo Liu, Boxiang Liu, David H. Mathews, and Liang Huang, "LinearDesign: Efficient Algorithms for Optimized mRNA Sequence Design", arXiv.org 4/21/2020:

A messenger RNA (mRNA) vaccine has emerged as a promising direction to combat the current COVID-19 pandemic. This requires an mRNA sequence that is stable and highly productive in protein expression, features which have been shown to benefit from greater mRNA secondary structure folding stability and optimal codon usage. However, sequence design remains a hard problem due to the exponentially many synonymous mRNA sequences that encode the same protein. We show that this design problem can be reduced to a classical problem in formal language theory and computational linguistics that can be solved in O(n^3) time, where n is the mRNA sequence length. This algorithm could still be too slow for large n (e.g., n = 3, 822 nucleotides for the spike protein of SARS-CoV-2), so we further developed a linear-time approximate version, LinearDesign, inspired by our recent work, LinearFold. This algorithm, LinearDesign, can compute the approximate minimum free energy mRNA sequence for this spike protein in just 11 minutes using beam size b = 1, 000, with only 0.6% loss in free energy change compared to exact search (i.e., b = +infinity, which costs 1 hour). We also develop two algorithms for incorporating the codon optimality into the design, one based on k-best parsing to find alternative sequences and one directly incorporating codon optimality into the dynamic programming. Our work provides efficient computational tools to speed up and improve mRNA vaccine development.

This is another chapter in the story of Aravind Joshi. I sketched the background in an obituary post "R.I.P. Aravind Joshi", 1/1/2018:

The Rumelhart Prize award page gives an excellent summary of his contributions up to 2003 — Google Scholar will give you a list of the hundreds of publications he has co-authored since that time.

One of those more recent publications — David Chiang, Aravind K. Joshi, and David B. Searls,  "Grammatical representations of macromolecular structure", Journal of Computational Biology 2006 — reminds me of a favorite Aravind Joshi memory. (Though it's actually about something that happened before the Rumelhart Prize award.)

Together with David Searls, a former Penn colleague who was then Senior Vice President for Computational Biology at GlaxoSmithKline, Aravind had organized a workshop on the mathematical analysis of two types of strings of discrete elements: sentences and macromolecules.

One of the presentations was by Yasuo Uemura, Aki Hasegawa, and Satoshi Kobayashi, who presented work published as "Tree adjoining grammars for RNA structure prediction", Theoretical computer science 1999, which was "concerned with identifying a subclass of tree adjoining grammars (TAGs) that is suitable for the application to modeling and predicting RNA secondary structures". They explain their motivation for choosing the TAG formalism:

TAGS have attracted great attention in linguistics for the reason that they have the ability to represent certain discontinuous constituents. For example, cross-serial dependencies in a string such as a1a2a3b1b2b3, where indices represent dependencies, is easily captured by a TAG, but not expressed by any context-free grammars. For another example, a string such as xuyvxRwyR, where x, y,u, v and w are strings and xR(yR) denotes the reversal of x(y), can also be readily expressed by a derived tree by some TAG. The structural feature of the latter example is called crossing dependency and as we will see soon, this extra power of TAGS to describe the crossing dependency seems to be just the right kind for a certain practical application of modeling and predicting biological sequence data.

Another presentation was by Elena Rivas and Sean Eddy, who described work published as "A dynamic programming algorithm for RNA structure prediction including pseudoknots", Journal of Molecular Biology 1999. From that paper's abstract:

We describe a dynamic programming algorithm for predicting optimal RNA secondary structure, including pseudoknots. […] The description of the algorithm is complex, which led us to adopt a useful graphical representation (Feynman diagrams) borrowed from quantum field theory.

After Elena's talk, Aravind and Elena spent the coffee break and some time thereafter at the whiteboard, exploring in detail the nature of the connections between Feynman Diagrams and Tree Adjoining Grammars. (Crudely, as I understand it, Feynman Diagrams offer an efficient solution for a certain class of integrals where most of an problematically increasing number of terms cancel as positive and negative pairs; these canceling pairs are analogous to the matching left and right brackets — or lexical dependencies — of a parsed sentence; both formalisms offer an analogy to the matching elements in macromolecule secondary structure. )

Liang Huang was Aravind's last student, and he wrote this in a comment on that obituary post:

I am deeply saddened by this news. Aravind brought me to Penn and I was his last PhD student. Both of us went to COLING 2002 and there he was recruiting students/postdocs for his new computational biology grant. He introduced me to the unexpected connections between grammars/parsing and structural biology, which was one of the "new" fields he delved into during his last 10 years of work (heavily featured in his Lifetime Achievement Award address, 2002). David, Julia, and I worked on this project, in collaboration with Ken Dill. However, much to his disappointment, I did not develop this line of work for my thesis; instead, I only narrowly focused on CL/NLP. Now after 10+ years, surprisingly, I'm back at the intersection of parsing algorithms and structural biology, exactly what Aravind wanted me to work on back in 2003! And by now I'm fully convinced that this direction is potentially more important than CL/NLP itself. Now it's my job to convince my own PhD students about this (which remains difficult). All I can tell is: Aravind was correct at the very beginning.

Several months ago at EMNLP 2017, Martha suggested that I give Aravind a call, esp. now that I finally had a breakthrough result in the direction he had me work on when I first started. I am so sad that I hadn't done that; Aravind would have been very happy hearing my result.

Mark, thanks for featuring so much about computational biology here. The Uemura et al and Rivas/Eddy papers you mentioned are now back at the center stage in my group; coincidentally, they were the very first two papers Aravind had me read before I started my PhD. I was deeply amazed by the connection back in 2002.

Aravind, I will continue this great direction in your honor.

Some background to the LinearDesign paper I started with:

Liang Huang, He Zhang, Dezhong Deng, Kai Zhao, Kaibo Liu, David A. Hendrix, and David H. Mathews, "LinearFold: linear-time approximate RNA folding by 5'-to-3' dynamic programming and beam search", xrXiv.org 12/22/2019:

Motivation: Predicting the secondary structure of an RNA sequence is useful in many applications. Existing algorithms (based on dynamic programming) suffer from a major limitation: their runtimes scale cubically with the RNA length, and this slowness limits their use in genome-wide applications.
Results: We present a novel alternative O(n3)-time dynamic programming algorithm for RNA folding that is amenable to heuristics that make it run in O(n) time and O(n) space, while producing a high-quality approximation to the optimal solution. Inspired by incremental parsing for context-free grammars in computational linguistics, our alternative dynamic programming algorithm scans the sequence in a left-to-right (5'-to-3') direction rather than in a bottom-up fashion, which allows us to employ the effective beam pruning heuristic. Our work, though inexact, is the first RNA folding algorithm to achieve linear runtime (and linear space) without imposing constraints on the output structure.

He Zhang, Liang Zhang, David H. Mathews, and Liang Huang, "LinearPartition: Linear-Time Approximation of RNA Folding Partition Function and Base Pairing Probabilities", arXiv.org 12/31/2019:

RNA secondary structure prediction is widely used to understand RNA function. Recently, there has been a shift away from the classical minimum free energy (MFE) methods to partition function-based methods that account for folding ensembles and can therefore estimate structure and base pair probabilities. However, the classical partition function algorithm scales cubically with sequence length, and is therefore a slow calculation for long sequences. This slowness is even more severe than cubic-time MFE-based methods due to a larger constant factor in runtime. Inspired by the success of our recently proposed LinearFold algorithm that predicts the approximate MFE structure in linear time, we design a similar linear-time heuristic algorithm, LinearPartition, to approximate the partition function and base pairing probabilities, which is shown to be orders of magnitude faster than Vienna RNAfold and CONTRAfold (e.g., 2.5 days vs. 1.3 minutes on a sequence with length 32,753 nt). More interestingly, the resulting base pairing probabilities are even better correlated with the ground truth structures. LinearPartition also leads to a small accuracy improvement when used for downstream structure prediction on families with the longest length sequences (16S and 23S rRNA), as well as a substantial improvement on long-distance base pairs (500+ nt apart).

The LinearDesign system has been used in vaccine and antibody development projects — details will be provided in another post, but for now, here's a link to the abstract of a 5/21/2020 talk.



1 Comment

  1. D.O. said,

    July 30, 2020 @ 10:36 pm

    Crudely, as I understand it, Feynman Diagrams offer an efficient solution for a certain class of integrals where most of an problematically increasing number of terms cancel as positive and negative pairs

    Not really. Feynman diagrams are a convenient graphical way to represent the rules of combining various interaction terms in quantum field theory. They are notorious for producing factorially increasing number of terms (nothing wrong with the diagrams themselves, that's the nature of the perturbative quantum field theory) lots of which must (partially) cancel in order for the theory to produce meaningful results. This cancellation is by no means automatic. If you discover a way to never even see the terms that are about to cancel, you can rent a tuxedo for the trip to Stockholm.

RSS feed for comments on this post