I've collected all the (non-foreclosure) listings for 8 cities from trulia.com — about 50,000 listings altogether — and extracted the descriptions, e.g.
Truly,"A Diamond in the Desert".From the lovely dbl. door glass entry out to the gorgeous yard with oversized pool, you will not be disappointed. All rooms are open and airy. The large kit. has a cntr.island w/cooktop. The raised bar area has a sink and area for sitting. There is a lg. room upstairs w/wet bar-perfect for a media or hobby room.The master ste. is large with a beaut.sitting area for relaxing + fab.WI closet.
This pre-war one bedroom home is located on a high floor on a beautiful tree lined block. Open northern exposures from each room with beautiful city views, and a spectacular view of The Cathedral of Saint John the Divine. The apartment has recently been completely renovated without removing the original pre-war charm. Features include beamed ceilings, original oak wood flooring, stainless steel appliances, and baseboard moldings. Part-time doorman 3 PM-7 AM, live in super, bike room, storage, and laundry. Pied a terre's, gifting, co-purchasing are permitted with board approval.
I then tokenized the descriptions, created a histogram of (monocased) tokens (roughly, "words") for each city, and normalized the counts to the scale of "occurrences per million tokens". Pick a random word, and there's a good chance that there are large differences in its frequency between listings in different cities — and when the counts are large, as they are in all the examples I'll show you today, such differences are "statistically significant" to a massive degree. (There are also often differences in counts by price, and also in counts by price by city, but that's another set of stories.)
Whether the differences are "significant" in any senses other than the "not likely to reflect sampling error" sense — and what those other senses might be — is an open question.
Just for graphical fun, I've plotted the word frequencies in pairs. Here's the hardwood/tile space:
Makes sense, right?
Also makes sense, I guess — at least I could make up a story about it. [Update: and the NYT has a story today about two outdoor parking spaces in Boston selling for $560,000.]
But how about some/all:
I haven't done any cherry-picking here, in the sense of trying lots of things and just showing you the good stuff. These are just the first few pairs of words whose geographical distributions I thought might be interesting. In fact, it's not easy to find words whose rates of usages are NOT correlated with geography, to a similar degree, in these different sets of listings. Presumably these differences have reasons — in the distributions of things typically described, in the phrases typically used to describe them, in purely stylistic local habits, or (occasionally) in random sampling variation.
But as I noted in my earlier post, once such sets of datasets are available, it's not hard to find patterns in and among them. And the counts are large enough that relatively few of the apparent patterns can be attributed purely to sampling error — though it's hard to know how to correct for multiple comparisons in such cases.