Language Log

Why estimating vocabulary size by counting words is (nearly) impossible

December 8, 2015 @ 3:31 pm· Filed by Mark Liberman under Computational linguistics

A few days ago, I expressed skepticism about a claim that "the human lexicon has a de facto storage limit of 8,000 lexical items", which was apparently derived from counting word types in various sorts of texts ("Lexical limits?", 12/5/2015). There are many difficult questions here about what we mean by "word", and what it means to be "in" the lexicon of an individual or a language — though I don't see how you could answer those questions so as to come up with a number as low as 8,000. But today I'd like to focus on some of the reasons that even after settling the "what is a word" questions, it's nearly hopeless to try to establish an upper bound by counting "word" types in text.

Read the rest of this entry »

Permalink Comments (8)

Kieran Snyder on CNN

November 7, 2015 @ 4:10 pm· Filed by Mark Liberman under Computational linguistics

Read the rest of this entry »

Permalink Comments (2)

A new source of jokes

November 5, 2015 @ 8:07 am· Filed by Mark Liberman under Computational linguistics

Greg Corrado, "Computer, respond to this email", Google Research Blog 11/3/2015:

I get a lot of email, and I often peek at it on the go with my phone. But replying to email on mobile is a real pain, even for short replies. What if there were a system that could automatically determine if an email was answerable with a short reply, and compose a few suitable responses that I could edit or send with just a tap? […]

Some months ago, Bálint Miklós from the Gmail team asked me if such a thing might be possible. I said it sounded too much like passing the Turing Test to get our hopes up… but having collaborated before on machine learning improvements to spam detection and email categorization, we thought we’d give it a try. […]

We’re actually pretty amazed at how well this works. We’ll be rolling this feature out on Inbox for Android and iOS later this week, and we hope you’ll try it for yourself! Tap on a Smart Reply suggestion to start editing it. If it’s perfect as is, just tap send. Two-tap email on the go — just like Bálint envisioned.

Read the rest of this entry »

Permalink Comments (6)

Bookworm on vector space models

November 1, 2015 @ 2:13 pm· Filed by Mark Liberman under Computational linguistics

A couple of great posts by Ben Schmidt at Bookworm: "Vector space models for the digital humanities", 10/25/2015; and "Rejecting the gender binary: a vector-space operation", 10/30/2015.

Update — A quick experiment by a Penn grad student, which confirms that somewhat plausible things emerge from fairly small and fairly noisy datasets…

Permalink Comments (4)

Alien encounter

October 20, 2015 @ 7:56 am· Filed by Mark Liberman under Computational linguistics, Language and gender, Language and literature

I read Ancillary Justice, the first book in Ann Leckie's Imperial Radch series, at some point in the spring of 2014, and so I was not at all surprised to find Brad DeLong referring to her as "an extremely sharp observer […] author of the devastatingly-good Ancillary Justice", in a blog post "Ann Leckie on David Graeber's "Debt: The First 5000 Mistakes": Handling the Sumerian Evidence Smackdown", 11/24/2014, where he quotes at length from her blog post "Debt", 2/24/2013.

And if you haven't read Ann Leckie's trilogy, you should do yourself a favor and start doing so right away. But this is Language Log, not Science Fiction Book Review Log or Unreliable Economic History Log, so why am I bringing up Ann Leckie now?

Read the rest of this entry »

Permalink Comments (18)

The G.K. Chesterton Prize for Ignoring Women

October 6, 2015 @ 9:59 am· Filed by Mark Liberman under Computational linguistics

Yogi Berra may or may not have said that "You can observe a lot just by watching". He didn't add that you can learn a lot just by counting — but as a baseball person, he surely knew the power of simple statistics.

You can learn a lot about G.K. Chesterton from the Wikipedia article about him, including his observation that "The whole modern world has divided itself into Conservatives and Progressives. The business of Progressives is to go on making mistakes. The business of the Conservatives is to prevent the mistakes from being corrected." But Wikipedia won't tell you that his fiction writing had a striking, perhaps unique, statistical property: he hardly ever uses feminine pronouns.

Read the rest of this entry »

Permalink Comments (65)

Non-programmer friendly

August 21, 2015 @ 6:26 am· Filed by Mark Liberman under Computational linguistics, Semantics

Brad DeLong linked to a paywalled Financial Times article by Lisa Pollack about problems with spreadsheet usage, and observed that

[C]onsiderations like these make me extremely hesitant when I think of asking my students in Econ 1 next spring to do problems sets in Excel. Shouldn’t I be asking them to do it in R via R Studio or R Commander instead? Audit trails are very valuable. Debuggability is very valuable. Excel ain’t got it…

The first comment, from "Captaindomestic":

I'm biased as a MathWorks employee, but you may want to look into MATLAB. It is really strong in the kinds of data analysis and plotting that econ students need to do. MATLAB has a pretty non-programmer friendly editor and model that helps new users.

Read the rest of this entry »

Permalink Comments (27)

Recommended For You

August 16, 2015 @ 8:00 am· Filed by Mark Liberman under Computational linguistics

Alexander Spangher, "Building the Next New York Times Recommendation Engine", NYT 8/11/2015:

The New York Times publishes over 300 articles, blog posts and interactive stories a day.

Refining the path our readers take through this content — personalizing the placement of articles on our apps and website — can help readers find information relevant to them, such as the right news at the right times, personalized supplements to major events and stories in their preferred multimedia format.

Read the rest of this entry »

Permalink Comments (6)

Easy-to-use frustration

August 6, 2015 @ 8:07 am· Filed by Mark Liberman under Computational linguistics, Humor, Nerdview

"Important – Please contact us to provide more information."

That's what the letter from Independence Blue Cross said. Dated 7/28/2015, it arrived 8/4/2015, and informed me that I need to "call or respond online within seven business days to ensure that your future claims and those of your family members can be processed in a timely manner." So today is the deadline.

What do I need to contact them about? "We are required to determine if you or your family members have other health insurance coverage to process your claims."

OK, fair enough. And they inform me that "You can choose the most convenient way to provide this information to us". The first option is to "Simply dial 1-866-507-6575 and follow the prompts on our easy-use interactive voice response system"; the second option is "to visit our member website at www.ibxpress.com".

But it turns out that there are a couple of problems. The first problem is that both methods fail at the first step. And the second problem is that there's apparently no other way to contact them to "provide more information … to ensure that your future claims and those of your family members can be processed in a timely manner".

Read the rest of this entry »

Permalink Comments (33)

A new AI problem

August 1, 2015 @ 7:11 pm· Filed by Mark Liberman under Computational linguistics

Here's a task that I haven't heard about: recognizing mixed metaphors and idiom blends.

For example, from Bob Ford, "Eagles season can go one of three ways", Philadelphia Inquirer 8/2/2015:

If the Eagles win big this season, they will get bonus points for degree of difficulty. The tightrope over which success is stretched is very narrow.

And at the end of the piece:

Those are the three doors, and, admit it, the Eagles could open any of them this season. As training camp begins, there is no way to tell. There could be opportunity knocking or a doorbell tolling. Finding out which will take a while, though.

Read the rest of this entry »

Permalink Comments (8)

Annals of LID

July 19, 2015 @ 9:39 am· Filed by Mark Liberman under Computational linguistics, Language and politics

Nice fucking try, Twitter: pic.twitter.com/ityPbk4hLy

— Jim Henley (@UOJim) July 18, 2015

Read the rest of this entry »

Permalink Comments (9)

Marriage O'Quality

May 23, 2015 @ 7:49 am· Filed by Mark Liberman under Computational linguistics

Tweeted by Graeme Orr:

Marriage O'Quality. Comhghairdeas Éire! #marriageeqaulity

— Graeme Orr (@Graeme_Orr) May 23, 2015

Read the rest of this entry »

Permalink Comments (8)

OK Google

May 23, 2015 @ 3:47 am· Filed by Mark Liberman under Computational linguistics

A couple of days ago, I gave a talk at the Centre Cournot on the topic "Why Human Language Technology (almost) works" ("Pourquoi les technologies de la langue et du discours marchent enfin (ou presque)"), and for the introduction, I tried giving Google Now a few questions and instructions on my Android phone.

In case you're not familiar with this feature, you start it up by saying "OK Google", followed by the question you want it to answer or the instruction you want it to follow.

And since the starting-point of my talk was that HLT now actually works well enough to be useful, I was glad to see that my little experiment worked pretty well.

Read the rest of this entry »

Permalink Comments (14)

Archive for Computational linguistics

Why estimating vocabulary size by counting words is (nearly) impossible

Kieran Snyder on CNN

A new source of jokes

Bookworm on vector space models

Alien encounter

The G.K. Chesterton Prize for Ignoring Women

Non-programmer friendly

Recommended For You

Easy-to-use frustration

A new AI problem

Annals of LID

Marriage O'Quality

OK Google

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta