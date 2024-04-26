« previous post |

Every individual's speech is variable — and when we look beyond the individual, we see variation across space, time, style, and social structure — among other dimensions. And these variations are generally gradient rather than abrupt, although standardization efforts by national or regional governments may try to eliminate the variation.

For millennia, scholars have noted and catalogued these patterns of variation — and for the past couple of hundred years, this study has been called dialectology. But until 1970 or so, people interested in this topic faced an uncomfortable choice: you can either pretend (falsely) that the variation can be put into a few well-defined boxes; or else you can limit your research to compiling very large lists of who said what where, when, and why.

About 50 years ago, some European researchers began trying to get past this dichotomous barrier, under the banner of "dialectometry". For a recent survey, see Martijn Wieling and John Nerbonne, "Advances in dialectometry", 2015 (from which I'll quote a long explanation):

The great tradition of European dialect geography produced innumerable detailed maps depicting the geographic distribution of variation, especially in word choice, pronunciation, and morphology. Researchers naturally sought to identify the deeper geographic and social structures that might be assumed to underlie many details and that might be examined as potentially explanatory. But as Bloomfield’s (1933, p. 340ff) classic discussion of this work noted, the maps of individual features often simply did not coincide, leading him to conclude that “in this respect […] dialect geography proved to be disappointing.” The problem usually revolved around how one should distinguish dialect areas, but modern dialectology recognizes that geographic distributions may involve continua or even scattered settlements.

Jean Séguy (1971, 1973) is credited with taking the liberating step of examining not individual features, but rather large aggregates of features, effectively asking how often two sites differ with respect to a given set of features (such as lexicalizations, but also the pronunciation of selected sounds, or the realization of a given morpheme). It is historically noteworthy that Haag (1898) had suggested something very similar, namely counting the isoglosses that separated sites to assay the strength of a putative border separating them, as noted by Bloomfield (1933) in the chapter cited above. Séguy not only took this step but presciently applied it to one of the foundational questions in dialect geography, the relation between aggregate linguistic differences and geographic distance (Séguy 1971).

In a programmatic article, Nerbonne (2009, p. 179) summed up the motivation for dialectometry’s attention to aggregates rather than individual features, arguing that the common practice of abstracting away from many details of phonetic variation is an implicit sort of aggregation that all variationists have accepted, and further noting that individual features are inevitably noisy (interpreting Bloomfield’s point above in this way). He also observed that the sheer number of available features makes it likely that a researcher focused on individual features can find some feature or other that coincides with a putative social or geographical influence, exposing the researcher to the danger of “cherry picking” — working with features that are selected (perhaps innocently) to confirm his or her hypotheses. Nerbonne (2009, pp. 190–91) finally notes that moving the analysis from the (categorical) level of individual features to the (numerical) level of aggregates enables language variationists to study general relations such as the law-like relation between linguistic differences and geographic distance demonstrated by Séguy (1971).

[…]

Whereas Black (1976) introduced multidimensional scaling (MDS) to linguistics, Embleton (1993) applied the technique specifically to dialectometry (see Embleton et al. 2013 for more recent work on alternative MDS visualizations). MDS takes a site × site distance table as input and tries to assign the sites in the table to coordinates in a small-dimensional space, typically consisting of two or three dimensions. Nerbonne et al. (1999) mapped MDS coordinates to color values for the first time, providing visual correlates in response to the frequent critique found in dialect atlases and treatises that the division of the language area into different dialect areas did little justice to the gradual nature of dialect boundaries. Figure 1 shows an example of one of these MDS maps, visualizing Dutch phonetic variation, together with a legend providing examples of words and how they are pronounced in their “fuzzy” areas. Heeringa’s (2004) dissertation used this form of presentation as well. Heeringa identified “typical” word pronunciations by selecting words whose distances correlated highly with the (distances on the basis of the) dimensions proposed in MDS, effectively the intensity of the colors shown in Figure 1 .

Figure 1 — legend: The three most important multidimensional scaling dimensions (together accounting for more than 85% of the variation in the location × location distance table) have been mapped to red, green, and blue, thereby providing a comprehensive visualization of Dutch phonetic variation. The five legends provide some typical pronunciations in the areas with the purest colors. Note that areas are genuine, even though borders are gradual.

For more, you can read the rest of that article — or some of the other references offered by Google Scholar.

You'll learn that similar approaches have also been used to characterize stylistic, social, and temporal patterns of variation. And you'll also learn that this tradition has not in general tried to add to (or even really use) the inventory of terms for ways of talking, such as dialect, topolect, sociolect, idiolect, ethnolect, variety, style, …

Rather, the point has been to replace Bloomfield's disappointment with insight — by exploring ways to analyze and visualize the complex patterns of variation. This effort has been most successful as a way of looking at patterns in space, as in the figure reproduced above.

Beyond applications to European languages, such techniques have been applied to Berber, Javanese, Iranian, Lalo, and others. But as far as I know, the (different varieties of) Han languages (or dialects or topolects or whatever) have not yet been analysis in this way, although a large initial tranche of needed data has been provided by the Linguistic Atlas of Chinese Dialects.

And based on that source, He Huang, Jack Grieve, Lei Jiao, and Zhuo Cai have published "Geographic structure of Chinese dialects: a computational dialectometric approach". One of that publication's many maps is reproduced below — the authors comment that

This map clearly shows that the Chinese dialect landscape is highly complex, consisting neither of a single dialect continuum nor a collection of distinct dialect areas separated by sharp borders. Instead, it includes clear dialect areas of relative homogeneity and varying sizes separated by both relatively sharp borders and areas of more gradual transition.

The use of the term "dialect" in these publications starts with the (translations of the) Chinese sources. But the term is problematic, because it describe a collection of ways of talking that are at least as diverse (and mutually (un)intelligible) as the Romance "dialects" like French and Italian and Spanish, or the Germanic "dialects" like German and Dutch and English. For some of the varieties, it probably makes sense to use the term "language" — and for others, perhaps Victor Mair's term "topolect" makes more sense. But the point of the dialectometric approach is to examine the data without assuming any particular bin boundaries, and to let the number and nature of the divisions emerge from the quantitative analysis.

There are many interesting methodological issues in the Huang et al. paper, but (for now) I'll leave readers to explore them on their own.

