{"id":3435,"date":"2011-09-18T07:42:54","date_gmt":"2011-09-18T12:42:54","guid":{"rendered":"http:\/\/languagelog.ldc.upenn.edu\/nll\/?p=3435"},"modified":"2011-09-18T07:52:54","modified_gmt":"2011-09-18T12:52:54","slug":"non-markovian-yawp","status":"publish","type":"post","link":"https:\/\/languagelog.ldc.upenn.edu\/nll\/?p=3435","title":{"rendered":"Non-markovian yawp"},"content":{"rendered":"<p>Now that I've got morning internet access again, and the semester is more or less underway, it's time for another Breakfast Experiment\u2122.<\/p>\n<p>In \"<a href=\"http:\/\/languagelog.ldc.upenn.edu\/nll\/?p=3277\">Markov's Heart of Darkness<\/a>\" (7\/18\/2011) and\u00a0<span>\"<a href=\"http:\/\/languagelog.ldc.upenn.edu\/nll\/?p=3261\">Finch linguistics<\/a>\" (7\/13\/2011) <\/span>, we learned that Joseph Conrad's paragraphs are more markovian &#8212; at least in terms of their distribution of lengths &#8212; than zebra finch song bouts are. So I wondered about length distributions in some other sources &#8212; pause groups in conversational speech, and lines in Walt Whitman's poetry.<\/p>\n<p><!--more--><\/p>\n<p>By \"pause group\" I mean simply the stretch of speech between silent pauses, as described in \"<a href=\"http:\/\/itre.cis.upenn.edu\/~myl\/languagelog\/archives\/003011.html\">The shape of a spoken phrase<\/a>\", 4\/12\/2006. As a source of data, I used the\u00a0<a href=\"http:\/\/www.isip.piconepress.com\/projects\/switchboard\/\">Mississippi State word alignments<\/a> for Switchboard. (In particular, the version that I used is\u00a0<a href=\"http:\/\/ldc.upenn.edu\/myl\/swtichboard_word_alignments.tar.gz\">here<\/a>, and\u00a0the word counts for the\u00a0509,242\u00a0pause groups in the corpus, as I extracted them from the *word.text files, are\u00a0<a href=\"http:\/\/languagelog.ldc.upenn.edu\/myl\/SWB_PauseGroupLengths\">here<\/a>.)<\/p>\n<p>The minimum length is one, and the maximum is 63:<\/p>\n<p><a href=\"http:\/\/languagelog.ldc.upenn.edu\/myl\/SWBPauseGroups1.png\"><img decoding=\"async\" title=\"Click to embiggen\" src=\"http:\/\/languagelog.ldc.upenn.edu\/myl\/SWBPauseGroups1.png\" alt=\"\" width=\"490\" \/><\/a><\/p>\n<p>The mode is 1 &#8212; as it would have to be for a two-state markov process to be responsible for the data &#8212; and the mean is 6.03. However, the empirical probability of continuing after N words is by no means constant:<\/p>\n<p><a href=\"http:\/\/languagelog.ldc.upenn.edu\/myl\/SWBPauseGroups3.png\"><img decoding=\"async\" title=\"Click to embiggen\" src=\"http:\/\/languagelog.ldc.upenn.edu\/myl\/SWBPauseGroups3.png\" alt=\"\" width=\"490\" \/><\/a><\/p>\n<p>(The histogram counts are\u00a0<a href=\"http:\/\/languagelog.ldc.upenn.edu\/myl\/SWB_PauseGroupHist\">here<\/a>, and the calculated empirical probabilities of continuation are\u00a0<a href=\"http:\/\/languagelog.ldc.upenn.edu\/myl\/SWB_ContinuationProbabilities\">here<\/a>.)<\/p>\n<p>I conjecture that the special behavior of very short pause-groups reflects the fact that these conversational pause groups are a mixture of at least two quite different processes, one process generating quasi-independent contributions like<\/p>\n<p style=\"padding-left: 30px;\"><span style=\"color: #000080;\">yeah i mean<br \/>\nfor somebody who is<br \/>\nyou know for most of their life has has<br \/>\nuh<br \/>\nnot just merely had a farm but had ten children<br \/>\nhad a farm<br \/>\nran everything because her husband was away in the coal mines and<br \/>\nand you know facing that situation it it's quite a dilemma<br \/>\ni think<\/span><\/p>\n<p>and the other generating short backchannel feedback like \"right\", \"I know\", \"yeah really\", \"oh I see\", &#8230;<\/p>\n<p>The larger-scale process seems to gradually run out of steam, in the sense that beyond 3, the probability of continuing falls gradually and steadily up to and beyond 25 words. This decline is systematic and statistically significant, and it means that the length distribution of even the longer pause groups can't reflect a simple markov process (for the reasons explained <a href=\"http:\/\/languagelog.ldc.upenn.edu\/nll\/?p=3261\">here<\/a>). However, the fall is quite gradual, and the distribution is quite a smooth one, so that (especially if we treat length one as special) it would be pretty well approximated by an exponential decay:<\/p>\n<p><a href=\"http:\/\/languagelog.ldc.upenn.edu\/myl\/SWBPauseGroups2.png\"><img decoding=\"async\" title=\"Click to embiggen\" src=\"http:\/\/languagelog.ldc.upenn.edu\/myl\/SWBPauseGroups2.png\" alt=\"\" width=\"490\" \/><\/a><\/p>\n<p>What about sequential effects? At least taking each conversational side by itself, there is a statistically significant but small (r = 0.1) positive correlation between the lengths of adjacent pause groups. Linear regression yields<\/p>\n<p style=\"padding-left: 30px;\">G<sub>n+1<\/sub> = 5.4 + 0.1*G<sub>n<\/sub><\/p>\n<p>That is, the length in words of pause group n+1 is predicted to be 5.4 plus one tenth of the length of pause group n. There's enough data that the slope of this relationship is clearly different from zero &#8212; \u00a0p &lt; 2*10^(-16), according to R &#8212; but only around 1% of the variance in length is being accounted for.<\/p>\n<p>In order to look at line lengths in Whitman's poetry, I downloaded the 1881-1882 edition of Leaves of Grass from the <a href=\"http:\/\/whitmanarchive.org\/published\/LG\/index.html\">Whitman Archive<\/a>, and ran the .html file through a little script to eliminate (I hope) everything but poem text and titles, with the titles used to delimit poems but otherwise ignored. Run-on lines (marked with line-initial spaces in this text) were joined. The histogram of line lengths in the result is as follows:<\/p>\n<p><a href=\"http:\/\/languagelog.ldc.upenn.edu\/myl\/WhitmanLinesX1.png\"><img decoding=\"async\" title=\"Click to embiggen\" src=\"http:\/\/languagelog.ldc.upenn.edu\/myl\/WhitmanLinesX1.png\" alt=\"\" width=\"490\" \/><\/a><\/p>\n<p>Here we can see some evidence that a different process is involved in creating the counts for very short lines; and in this case, we can tell more exactly what it is.\u00a0I didn't take the time to distinguish section numbers and internal section names from lines of poetry, and most of these &#8212; there are about 200 of them in the 10,202 lines &#8212; are of length 1. And in fact, most of the the \"lines\" of length 1 are such section numbers and internal section names.<\/p>\n<p>The modal line length is 10 words &#8212; with or without omitting the lines of length 1 &#8212; and the mean is 11.1. Omitting \"lines\" of length 1, the mean is 11.3.<\/p>\n<p>This seems to be a distinctly different pattern from Conrad's paragraphs and Switchboard pause groups. But a plot of the empirical probability of continuation has some qualitative similarities to the Switchboard plot:<\/p>\n<p><a href=\"http:\/\/languagelog.ldc.upenn.edu\/myl\/WhitmanLinesX2.png\"><img decoding=\"async\" title=\"Click to embiggen\" src=\"http:\/\/languagelog.ldc.upenn.edu\/myl\/WhitmanLinesX2.png\" alt=\"\" width=\"490\" \/><\/a><\/p>\n<p>In particular, there's an initial rise (here from 1 to 2) reflecting the fact that most very short \"lines\" are the output of a quite different process; and then a steady fall (here from 2 to 15) reflecting the same non-markovian \"running out of steam\" phenomenon that we saw in the Zebra Finch song bouts (\"<a href=\"http:\/\/languagelog.ldc.upenn.edu\/nll\/?p=3261\">Finch linguistics<\/a>\", 7\/13\/2011) as well as in the Switchboard pause groups.<\/p>\n<p>The patterns are quantitatively quite different &#8212; but perhaps a wider range of free-verse authors, and a wider range of speech styles, would yield more quantitative overlap. In particular, it seems possible that skilled extemporaneous narrative might have a modal pause-group length more like Whitman's modal line length.<\/p>\n<p>What about sequential effects in the Whitman line-length data? There's also a positive correlation between the lengths of adjacent lines, and it's a bit larger than in Switchboard pause groups: r=0.3, 9% of variance accounted for. The coefficients of a linear model are<\/p>\n<p style=\"padding-left: 30px;\">L<sub>n+1<\/sub> = 7.9 + 0.3*L<sub>n<\/sub><\/p>\n<p>A two-dimensional histogram exhibits the relationship graphically:<\/p>\n<p><a href=\"http:\/\/languagelog.ldc.upenn.edu\/myl\/WhitmanLines4.png\"><img decoding=\"async\" title=\"Click to embiggen\" src=\"http:\/\/languagelog.ldc.upenn.edu\/myl\/WhitmanLines4.png\" alt=\"\" width=\"490\" \/><\/a><\/p>\n<p>The aspect of all of this that intrigues me the most is the prevalence of processes in which the probability of continuing decreases gradually as the generated string lengthens. We see the same thing in zebra finch song bouts, conversational pause groups, and lines of free verse. It's a simple idea, and easy to implement algorithmically, but I haven't seen a mathematical or neurological treatment (though this is at least as likely to reflect my ignorance as the state of the literature).<\/p>\n<p>Anyhow, Whitman also saw a connection between his poetry and avian vocalizations:<\/p>\n<p style=\"padding-left: 30px;\"><span style=\"color: #800000;\">The spotted hawk swoops by and accuses me, he complains of my gab and my loitering.<\/span><\/p>\n<p style=\"padding-left: 30px;\"><span style=\"color: #800000;\">I too am not a bit tamed, I too am untranslatable,<br \/>\nI sound my barbaric yawp over the roofs of the world.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Now that I've got morning internet access again, and the semester is more or less underway, it's time for another Breakfast Experiment\u2122. In \"Markov's Heart of Darkness\" (7\/18\/2011) and\u00a0\"Finch linguistics\" (7\/13\/2011) , we learned that Joseph Conrad's paragraphs are more markovian &#8212; at least in terms of their distribution of lengths &#8212; than zebra finch [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[60],"tags":[],"class_list":["post-3435","post","type-post","status-publish","format-standard","hentry","category-computational-linguistics"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/posts\/3435","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3435"}],"version-history":[{"count":0,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/posts\/3435\/revisions"}],"wp:attachment":[{"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3435"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3435"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3435"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}