## More on the statistics of real-estate listings

Early last summer, an inquiry from Sanette Tanaka at the WSJ led me to do a Breakfast Experiment™ on the relationship between the language of real-estate listings and the price of the associated properties ("Long is good, good is bad, nice is worse, and ! is questionable", 6/12/2013; "Significant (?) relationships everywhere", 6/14/2013; "City of the big disjunctions", 6/20/2013).

Since then, Bob Stine and Dean Foster (Wharton Statistics) and I have done a more serious investigation in this area. Bob has put a draft paper up on his web site (Foster, Liberman, and Stine, "Featurizing text: Converting text into predictors for regression analysis") along with a set of slides. There's also a video of Bob giving a talk at CUNY last month about this work.

I don't have time this morning for a longer explanation (it's morning in Paris, where I am at the moment), but here's how Bob headlines the paper:

This draft manuscript (really more of a working paper) describes fast methods for the construction of numerical regressors from text using spectral methods related to the singular value decomposition (SVD). An example uses these methods to build regression models for the price of Chicago real estate using nothing but the text of a property listing. Topic models (LDA) provide some explanation for why these methods work so well as they do. For example, our model for real estate explains some 70% of the variation in prices using just the text of the listing with no attempt to use location or related demographics.

## 3 Comments

1. ### Steve said,

November 19, 2013 @ 3:11 pm

"Topic models (LDA) provide some explanation for why these methods work so well as they do." I realize it may be too late to tweak the way Bob "headlines" the paper, but, FWIW, I would say that the models explain why the methods "work as well as they do" or that they explain why they "work so well." Or, perhaps, "… for why these methods work so well, as they do." But the last seems a bit clunky.

2. ### Jonathan said,

November 19, 2013 @ 8:15 pm

I attended this talk at CUNY and the most shocking thing, to an economist, is that Stine made principal component analysis seem theoretically sound. I was taught in graduate school that it almost never is. It's quite a good talk and we're investigating the methodology for an in-house application.

3. ### Don said,

December 6, 2013 @ 4:28 pm

Surely the work in in the first Freakonomics book is related? (Correlation between words and final selling price as seller-agents subtly signal the potential buyers about the home.)
http://freakonomics.com/books/freakonomics/chapter-excerpts/chapter-2/