Language Log

Recommended For You

August 16, 2015 @ 8:00 am · Filed by Mark Liberman under Computational linguistics

Alexander Spangher, "Building the Next New York Times Recommendation Engine", NYT 8/11/2015:

The New York Times publishes over 300 articles, blog posts and interactive stories a day.

Refining the path our readers take through this content — personalizing the placement of articles on our apps and website — can help readers find information relevant to them, such as the right news at the right times, personalized supplements to major events and stories in their preferred multimedia format.

Spangher describes "Content-Based Filtering", which depends on the distribution of words and word-sequences in the articles you've previously read; and "Collaborative Filtering", which looks at the articles read by other readers who have read some of the same articles that you have. He notes problems with each approach, leading to their new algorithm,

. . . inspired by a technique, Collaborative Topic Modeling (CTM), that (1) models content, (2) adjusts this model by viewing signals from readers, (3) models reader preference and (4) makes recommendations by similarity between preference and content.

He links to the paper that inspired them (Chang Wang and David Blei, "Collaborative Topic Modeling for Recommending Scientific Articles", KDD 2011), and discussed how they've met a "three-part challenge":

Part 1: How to model an article based on its text.
Part 2: How to update the model based on audience reading patterns.
Part 3: How to describe readers based on their reading history.

The solution, in brief, is to use Latent Dirichlet Allocation to place articles in a low-dimensional topic space; to use the Collaborative Topic Modeling method to iteratively adjust article placement based the apparent topic interests of each article's readers; and to use a weighted average of the topic-space position of articles read as "a quick way to calculate reader preferences".

If you're interested in this sort of thing, read all of Spangher's piece, and the 2011 CTM article, and perhaps some of the 272 articles that Google Scholar lists as citing the CTM article.

But whether you're interested in the details or not, you should take note of an increasingly important kind of technology that doesn't have a name, as far as I know. It's emerged from 50 years of research, and 20 years of increasingly-broad application.

These techniques apply to collections of texts that are associated with a number of other features — in the current example it's articles and readers (and maybe dates and places and authors?); it might be web pages with their domain information and link graph; it might be a bibliometric network of authors, affiliations, journals, publishers, articles; it might be a network of twitter authors, times, places, hashtags; or product reviews along with star ratings, author IDs and product descriptions; or the text of open-ended survey responses along with multiple-choice outcomes and subject demographics; or Facebook posts with authors' demographic information and personality-test results; or collections of real-estate listings with locations and prices and sales information; or job listings with information about applicants and outcomes; or . . .

Recommendation systems are just one of many applications. The problems to be solved range from easy to impossible, and the algorithms used range from simple to complex, and from obvious to subtle and surprising. (Sometimes the most subtle and surprising methods are also the simplest…)

There are obvious (and existing) applications in commerce, in medicine, in sociology, in law, in education, in literary studies — given the increasing digitization of communication, it's hard to think of any domain where this kind of technology is not already applied or soon to be applied. Most large companies have at least dipped their toes into this area, and some of them have plunged in enthusiastically. And new companies are springing up like the proverbial mushrooms after a rain.

There are obviously close connections to non-textual problems. The "collaborative filtering" method is content-neutral, so that a music recommendation system using this technique is basically identical to an article recommendation system — but as Spangher observes, there are good reasons to add content-based information to systems based purely on preference networks. For many other applications, combining content analysis with other dimensions of information is essential. And in a large range of cases, the most accessible and useful source of content is text.

Given all of this, it's odd that the technology we're talking about doesn't have a name.

August 16, 2015 @ 8:00 am · Filed by Mark Liberman under Computational linguistics

Permalink

6 Comments

David Q said,

August 16, 2015 @ 9:43 am

I would like to understand the thing-with-no-name, but to me this post reads more like a guessing game with a collection of hints than a definition awaiting a label.

Perhaps that's part of the difficulty of describing something that you don't have a name for. But what I see here is a collection of assertions about the broad scope of "the technology" without a description of its particulars or its limits. Are content-based filtering and collaborative filtering examples of "the technology"?

Given the clues available, I'd suggest the author to consider "information retrieval", and if that proves too broad, "document modeling".
Viseguy said,

August 16, 2015 @ 7:37 pm

Multiplexting ?
Rubrick said,

August 17, 2015 @ 2:06 am

"It's hard to think of any domain where this kind of technology is not already applied or soon to be applied."

Face painting, whistling, and waterbed repair spring to mind.

[(myl) Fair enough. Make that "… (potentially) networked digital domain …"]
Jason Stokes said,

August 17, 2015 @ 7:52 am

This may have been what Ted Nelson was actually getting at with his "Xanadu" vision/framework/technology/mental masturbation vaporware project. Xanadu is usually described as merely "hypertext", fufilled by the World Wide Web, but if you read Nelson's ever more florid panegyrics to his original vision, mere hypertext was not true Xanadu — it also contained elements of digital rights management, user-driven metatext and interlinking in what we'd probably now call vast recommender or information retrieval systems.
_NL said,

August 17, 2015 @ 12:22 pm

Rubrick: I wouldn't be shocked if home demographics (number of young children likely to be in the house), shopping behaviors (frequent purchases of cleaning products and paper towels) and warranty preferences (buying the spill protection and drop protection for furniture and electronics) might not make waterbed repair susceptible to the process that shall not be named.

Years ago, it was revealed that Target stores sent out flyers based on the presumed shopping preferences of customers. The algorithm was good enough to predict the stage of pregnancy, and in one case the young woman in the household had not informed her father of the pregnancy. Now Target still does the focused selling, but they mix in enough irrelevant filler products to the ad that you feel like they aren't rummaging through every one of your receipts.

I nominate the term "preference mining." It might be applicable to any business or industry that might want to reach a relatively disparate number of customers who behave as though they are anonymous. It might also have applications to customers who are not anonymous or numerous, like in relationship-driven professional service industries, where more effective schmoozing of a smaller number of clients could be seen as advantageous.

You might even see online ads, coupons and spam-email efforts focused on advertising face-painting services and whistling lessons to interested parties. This technology is really about gathering data on multiple people without needing to interview them, then predicting choices they might make. So a particularly ambitious (megalomaniacal?) algorithm might easily be used in an attempt to predict elections, foreign policy, outbreaks of war or even susceptibility to disease or illness.
Billrr said,

August 17, 2015 @ 12:25 pm

What I'd like to know is how to opt out of this filtering. Why should I trust their algorithm to tell me what I can or cannot see, and thereby limit my choice of what to read? Given inherent limits of screen size, etc., if they're promoting something they predict I'll like to read, implicitly something else I might like to read but that is totally unrelated to anything they've seen me read will be hidden from me. How can that be a Good Thing?

RSS feed for comments on this post

Recommended For You

6 Comments

David Q said,

Viseguy said,

Rubrick said,

Jason Stokes said,

_NL said,

Billrr said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta