Dialectometry
« previous post | next post »
Every individual's speech is variable. And when we look beyond the individual, we see variation across space, time, style, and social structure — among other dimensions. And these variations are generally gradient rather than abrupt, although standardization efforts by national or regional governments may try to eliminate the variation.
For millennia, scholars have noted and catalogued these patterns of variation — and for the past couple of hundred years, this study has been called dialectology. But until 1970 or so, people interested in this topic faced an uncomfortable choice: you can either pretend (falsely) that the variation can be put into a few well-defined boxes; or else you can limit your research to compiling very large lists of who said what where, when, and why.
About 50 years ago, some European researchers began trying to get past this dichotomous barrier, under the banner of "dialectometry". For a recent survey, see Martijn Wieling and John Nerbonne, "Advances in dialectometry", 2015 (from which I'll quote a long explanation):
The great tradition of European dialect geography produced innumerable detailed maps depicting the geographic distribution of variation, especially in word choice, pronunciation, and morphology. Researchers naturally sought to identify the deeper geographic and social structures that might be assumed to underlie many details and that might be examined as potentially explanatory. But as Bloomfield’s (1933, p. 340ff) classic discussion of this work noted, the maps of individual features often simply did not coincide, leading him to conclude that “in this respect […] dialect geography proved to be disappointing.” The problem usually revolved around how one should distinguish dialect areas, but modern dialectology recognizes that geographic distributions may involve continua or even scattered settlements.
Jean Séguy (1971, 1973) is credited with taking the liberating step of examining not individual features, but rather large aggregates of features, effectively asking how often two sites differ with respect to a given set of features (such as lexicalizations, but also the pronunciation of selected sounds, or the realization of a given morpheme). It is historically noteworthy that Haag (1898) had suggested something very similar, namely counting the isoglosses that separated sites to assay the strength of a putative border separating them, as noted by Bloomfield (1933) in the chapter cited above. Séguy not only took this step but presciently applied it to one of the foundational questions in dialect geography, the relation between aggregate linguistic differences and geographic distance (Séguy 1971).
In a programmatic article, Nerbonne (2009, p. 179) summed up the motivation for dialectometry’s attention to aggregates rather than individual features, arguing that the common practice of abstracting away from many details of phonetic variation is an implicit sort of aggregation that all variationists have accepted, and further noting that individual features are inevitably noisy (interpreting Bloomfield’s point above in this way). He also observed that the sheer number of available features makes it likely that a researcher focused on individual features can find some feature or other that coincides with a putative social or geographical influence, exposing the researcher to the danger of “cherry picking” — working with features that are selected (perhaps innocently) to confirm his or her hypotheses. Nerbonne (2009, pp. 190–91) finally notes that moving the analysis from the (categorical) level of individual features to the (numerical) level of aggregates enables language variationists to study general relations such as the law-like relation between linguistic differences and geographic distance demonstrated by Séguy (1971).
[…]
Whereas Black (1976) introduced multidimensional scaling (MDS) to linguistics, Embleton (1993) applied the technique specifically to dialectometry (see Embleton et al. 2013 for more recent work on alternative MDS visualizations). MDS takes a site × site distance table as input and tries to assign the sites in the table to coordinates in a small-dimensional space, typically consisting of two or three dimensions. Nerbonne et al. (1999) mapped MDS coordinates to color values for the first time, providing visual correlates in response to the frequent critique found in dialect atlases and treatises that the division of the language area into different dialect areas did little justice to the gradual nature of dialect boundaries. Figure 1 shows an example of one of these MDS maps, visualizing Dutch phonetic variation, together with a legend providing examples of words and how they are pronounced in their “fuzzy” areas. Heeringa’s (2004) dissertation used this form of presentation as well. Heeringa identified “typical” word pronunciations by selecting words whose distances correlated highly with the (distances on the basis of the) dimensions proposed in MDS, effectively the intensity of the colors shown in Figure 1 .
Figure 1 — legend: The three most important multidimensional scaling dimensions (together accounting for more than 85% of the variation in the location × location distance table) have been mapped to red, green, and blue, thereby providing a comprehensive visualization of Dutch phonetic variation. The five legends provide some typical pronunciations in the areas with the purest colors. Note that areas are genuine, even though borders are gradual.
For more, you can read the rest of that article — or some of the other references offered by Google Scholar.
You'll learn that similar approaches have also been used to characterize stylistic, social, and temporal patterns of variation. And you'll also learn that this tradition has not in general tried to add to (or even really use) the inventory of terms for ways of talking, such as dialect, topolect, sociolect, idiolect, ethnolect, variety, style, …
Rather, the point has been find insights to replace Bloomfield's disappointment — by exploring new ways to analyze and visualize the complex patterns of variation. This effort has been most successful as a way of looking at patterns in space, as in the figure reproduced above.
Beyond applications to European languages, such techniques have been applied to Berber, Javanese, Iranian, Lalo, and others. But as far as I know, the (different varieties of) Han languages (or dialects or topolects or whatever) have not yet been analysis in this way, although a large initial tranche of needed data has been provided by the Linguistic Atlas of Chinese Dialects.
And based on that source, He Huang, Jack Grieve, Lei Jiao, and Zhuo Cai have published "Geographic structure of Chinese dialects: a computational dialectometric approach". One of that publication's many maps is reproduced below — the authors comment that
This map clearly shows that the Chinese dialect landscape is highly complex, consisting neither of a single dialect continuum nor a collection of distinct dialect areas separated by sharp borders. Instead, it includes clear dialect areas of relative homogeneity and varying sizes separated by both relatively sharp borders and areas of more gradual transition.
The use of the term "dialect" in these publications starts with the (translations of the) Chinese sources. But the term is problematic, because it describes a collection of ways of talking that are at least as diverse (and mutually (un)intelligible) as the Romance "dialects" like French and Italian and Spanish, or the Germanic "dialects" like German and Dutch and English. For some of the varieties, it probably makes sense to use the term "language" — and for others, perhaps Victor Mair's term "topolect" makes more sense. But the point of the dialectometric approach is to examine the data without assuming any particular bin boundaries, and to let the number and nature of the divisions emerge from the quantitative analysis.
There are many interesting methodological issues in the Huang et al. paper, but (for now) I'll leave readers to explore them on their own.
Update — I should note that a different approach to making sense of multi-dimensional patterns of linguistic variation was pioneered by Bill Labov, also in the mid-20th century, under the name of "sociolinguistics". As that term implies, it focuses on social dimensions (age, gender, ethnicity, socio-economic status, formality, etc.) more strongly than on geographical dimensions. The quantitative analysis methods are drawn from multi-variate statistics, and also include purely linguistic factors as independent variables.
Garrett Wollman said,
April 26, 2024 @ 8:30 pm
This inevitably reminds me of Bill Kretzschmar's work on quantifying (distributions of) phonetic features and his diatribe(s) against taking the arithmetic mean of a non-normal distribution (statistically meaningless) and then treating that one number as characteristic of a population of speakers.
AG said,
April 26, 2024 @ 8:45 pm
Please forgive me a brief geographical sally – LOL at Chinese researchers being compelled to put the 9-dash line on every map but then having no data to include for that part, because… well, because it's so obviously not part of China, linguistically or otherwise.
Mark Liberman said,
April 26, 2024 @ 8:49 pm
@Garrett Wollman: This inevitably reminds me of Bill Kretzschmar's work on quantifying (distributions of) phonetic features […]
See John Nerbonne and William Kretzschmar. "Introducing computational techniques in dialectometry", Computers and the Humanities 37 (2003): 245-255.
Which reference would you recommend for the cited diatribe?
Chris Button said,
April 26, 2024 @ 10:25 pm
What does the "quantitative" analysis measure exactly? Does it measure the number of variations or the degree of each variation? And how does one define that?
Surely widespread borrowing also throws wrinkles in it all too? The most extreme case would be Japanese "on" readings as representing a dialect, or rather co-existing dialects, of Chinese within the Japanese language.
Garrett Wollman said,
April 27, 2024 @ 12:03 am
@myl My most immediate reference is to his talk at the 2024 ADS annual meeting. I emailed him afterwards and he sent me a couple of papers but I don't have them to hand. (I'm pretty sure he said something similar at the 2019 ADS meeting but that's long enough ago that I wouldn't stake anything on it. As you note he's been doing this line of research for a long time.)
Mark Liberman said,
April 27, 2024 @ 5:32 am
@Chris Button: "What does the "quantitative" analysis measure exactly?"
There are many variant approaches, depending on the goals and the available data — you can sample the literature to find the details.
For producing a geospatial map, the starting place is an array of linguistic data (which might be word choices, morphological choices, pronunciation choices, whatever) from M subjects located at N map coordinates. Then using one of many methods, an MxM or NxN array of inter-point distances is computed, and Multidimensional Scaling (or similar) is used to calculate a low-dimensional version of the distance matrix, which would then be turned into map colors.
Garrett Wollman said,
April 27, 2024 @ 10:52 am
To follow up on my previous comment, now that I am in the home office and can look, the most recent paper is "Point Pattern Analysis in Vowel Space" by Kretzschmar and Joseph Stanley (unpublished draft), which was the actual subject of Kretzschmar's ADS talk. I assume it will appear in an appropriate journal in the future.
Jonathan Smith said,
April 27, 2024 @ 11:20 am
As I kind of said in the other thread, it’s easy for readers and co-authors to get caught up in math and forget more fundamental questions.
One could also color, e.g., England + France based on a data set in which œil / eye, dent / tooth, pied / foot, etc., etc., were matches — i.e., shared "morphological choices". (This is, for those unfamiliar, exactly what is going on in the underlying data here: Chinese character headings in LACD [purport to] represent etyma, covering [suggested] daughter reflexes across all of Sinitic.)**
But what would be one’s motivation for doing so? What would one think was revealed by such a “dialectometric” approach, vs. what would actually be revealed (which would be, we must acknowledge, certainly something)? What sorts of lofty conclusions might one draw on the basis of such a map about the Indo-European language?
**As far as I can tell, the authors’ "all_china.csv" is not really “raw data extracted from the LACD” or translations thereof but abstract alphanumeric representations of etyma, regional forms, etc., as they are presented in that text — so anyone interested in linguistic reality should (1) get LACD, (2) learn Chinese, and go from there.
Chris Button said,
April 27, 2024 @ 12:22 pm
That would have essentially been my starting point, plus some Old Chinese evidence. But even then I still wonder how you quantify the importance of a major individual sound shift versus the totality of several less consequential shifts.
In my opinion, all linguistics (outside of perhaps pure articulatory phonetics) merits at least some attention to comparative historical linguistics.
Jarek Weckwerth said,
April 27, 2024 @ 2:24 pm
@ Chris Button: What does the "quantitative" analysis measure exactly?
One approach is e.g. Levenshtein distance of phonetic transcriptions at a defined level of accuracy. But I've seen at least one paper (I think it was on Norwegian) where actual spectra of actual recordings were the input. I can try locating it if you're interested.
[(myl) See here (spectral distances) and here ("neural" distances).]
widespread borrowing also throws wrinkles
I would think that the main problem is social variability in each geographical area. (Cf. the Labov thread above.)
Jonathan Smith said,
April 27, 2024 @ 3:12 pm
Re: "What does quantitative analysis measure exactly?"
If the question concerns this paper in particular, see Section 2.3 beginning on p. 13, in which we learn that
"Because LACD’s taxonomic system is categorical, it is not accessible for calculating edit distance based on phonetic forms. […] Instead, we have developed a method that we call 'Weighted Jaccard Distance' (WJD) to measure distances based on the LACD dataset. Jaccard distance is a measure of dissimilarity between two sample sets. It ranges from 0 to 1: the closer to 1, the more dissimilar the two data […]"
To take a specific example, in the case of the three words for 'I, me' which I noted in the other thread (which relate to the authors' map on p. 11 in turn from LACD) — Standard Mandarin /wo3/, Standard Cantonese /ŋɔ5/, and Southern Min Amoy /gua2/ — the authors' CODt, CODs, and CODm (co-differences for so-called "type", "subtype" and "mini-type" respectively as defined somewhat problematically by reference to columns and color-coding in the LACD) are 0 for all three pairs, thus DVf (difference between the two variants being under consideration) also = 0 for all three pairs.
That is to say, these three items (to which compare English I, German Ich, French Je) are all the same — because they are all assigned to the etymon represented "我" in LACD :D It's math, folks.
To be clear, the authors' results do mean something… but not exactly what they say. For more shall we say nuanced discussions of the Sinitic languages of China, see, e.g., De Gruyter's series The Sinitic Languages of China (*cough*)
David Marjanović said,
April 27, 2024 @ 3:14 pm
The existence of sociolects is more of an English phenomenon; it's far from global.
Jarek Weckwerth said,
April 27, 2024 @ 3:56 pm
@ Jonathan Smith: This warrants a really good look at that paper. It's far enough from the "traditional" Nerbonne etc. stuff that I find it hard to interpret as a simple phonetician.
@ David Marjanović: Can you elaborate? Which non-English-speaking areas would you have in mind?
Jichang Lulu said,
April 27, 2024 @ 7:16 pm
Doesn't it just defeat the purpose of a dialectometric study to smuggle etymological analysis into the metric? Because that's what's in the structure of the data here. Not to mention that, with some exceptions, that etymology (implied in the assignment of words or morphemes to graphemes of written Chinese) doesn't even rely on any actual diachronic work, but some combinaton of conventional accretion and guesswork.
The problem is of course inherited the dataset, but it looks like a fatal flaw. And language like in this passage seems to celebrate that flaw rather than seek to address it:
> The classification of word forms used by LACD is usually based on the word’s root (morpheme) rather than its phonetic form. Marking morphemes using Chinese characters is more convenient and economical than using complex phonetic word forms because Chinese characters are logographic.
This is fundamentally different from the goals illustrated in the map of the Netherlands above, where dialectometry lets you compare tokens independently of, say, subjective mutual intelligibility or diachronic analysis. For example, the Netherlands figure illustrates a similar treatment for Dutch forms meaning ‘milk’ (which are cognate) and ‘chicken’ (which include are not). Compare this to treatment of the words for ‘I’ Smith mentions.
The output of that kind of analysis can be compared to, and question, received language classification models based on simpler comparison, not to mention political dogma. Cf. discussion of ‘Italo-Romance’.
If (especially low quality) etymology is being fed (in an uncontrolled way) to the distance function, what's the point of critiquing previous work as ‘emphasiz[ing] a diachronic rather than a synchronic perspective’? And what's the significance of the study then largely recovering the categories of that earlier work?
Jichang Lulu said,
April 27, 2024 @ 7:31 pm
> being compelled to put the 9-dash line on every map
Ten dashes (since last year), and counting.
China_10-dash_line.bdf etc available in the supplementary materials, for those who need to update their non-compliant maps at home.
Jonathan Smith said,
April 27, 2024 @ 8:26 pm
Re: Jichang Lulu's "…etymology (implied in the assignment of words or morphemes to graphemes of written Chinese) doesn't even rely on any actual diachronic work, but some combinaton of conventional accretion and guesswork."
Yup — unfortunately, this problem is shared with more careful work like the Sinitic Languages series I mentioned above, where character representations of the same kind are explicitly labeled 'etyma'.
And re: the authors' "Marking morphemes using Chinese characters is more convenient and economical than using complex phonetic word forms because Chinese characters are logographic."
I called this the heart of the matter on the earlier thread — the statement is a mess for a number of reasons, one being that Chinese characters are indeed (to a certain degree) logographic *as they are applied within Standard Mandarin orthography, or within Cantonese orthography, etc.*, but "我" used to represent a pan-Sinitic etymon ain't logographic anymore cuz Mandarin wo3, Cantonese ŋɔ5, etc., iz different wurds.
And yeah as Jichang Lulu also notes, the authors presenting their approach as "synchronic" is a howler…
Victor Mair said,
April 27, 2024 @ 9:34 pm
In all of his comments on this post where he refers to his remarks on "the earlier thread", Jonathan Smith is talking about this:
https://languagelog.ldc.upenn.edu/nll/?p=63632#comment-1616316
Jarek Weckwerth said,
April 29, 2024 @ 3:31 pm
@ MYL: [(myl) See here (spectral distances) and here ("neural" distances).]
I almost missed this! Thank you, these look fascinating. I had something much much older in mind: Gooskens & Heeringa (2003) "Norwegian Dialects Examined Perceptually and Acoustically" : https://www.jstor.org/stable/30204903
KIRINPUTRA said,
May 2, 2024 @ 11:21 pm
Wow @Jichang Lulu & @Jonathan Smith for precisely & concisely challenging a couple of literally-everyday biases in Sinology (?) that are rarely challenged.
David Marjanović said,
May 5, 2024 @ 11:40 am
Pretty much all Spanish- or German-speaking places, nowadays also French- and Russian-speaking ones as far as I've noticed. People opening their mouth and thereby promptly identifying themselves as "definitely lower middle class" or whatever is really not common.
Jarek Weckwerth said,
May 6, 2024 @ 3:45 pm
@ David Marjanović: Pretty much all Spanish- or German-speaking places, nowadays also French- and Russian-speaking ones
I have to say I'm somewhat surprised. Just back from a few days in Berlin, and an earlier from a trip to Andalucia, I would say I need you to elaborate a little more. I'm also somewhat familiar with the situation in Vienna; and from the French hip-hop my wife used to listen to, and a trip to Montreal I would say there's considerable social variation there too even though French sociolinguistics isn't a forte of mine. Even my own country (Poland) which is far more homogenous linguistically than most, there is very clear variation between the acrolect and basilect; I would even argue it's more prominent than geographical variation.
But in general I would say there's very clear triangular variation a la Trudgill in all those languages. And it is the reverse situation, where the topolect (to borrow Victor Mair's terminology) extends all the way to the top of the social scale, like in Norway, is unusual for a larger language. (And even there, it seems to me now that I've been embedded in a project where Norwegian is a central element, it is exaggerated for ideological reasons.)