{"id":23210,"date":"2015-12-31T05:11:28","date_gmt":"2015-12-31T10:11:28","guid":{"rendered":"http:\/\/languagelog.ldc.upenn.edu\/nll\/?p=23210"},"modified":"2016-01-02T09:47:22","modified_gmt":"2016-01-02T14:47:22","slug":"normalizing","status":"publish","type":"post","link":"https:\/\/languagelog.ldc.upenn.edu\/nll\/?p=23210","title":{"rendered":"Normalizing"},"content":{"rendered":"<p>Alberto Acerbi , Vasileios Lampos, Philip Garnett, &amp; R. Alexander Bentley, \"<a href=\"http:\/\/journals.plos.org\/plosone\/article?id=10.1371\/journal.pone.0059030\" target=\"_blank\">The Expression of Emotions in 20th Century Books<\/a>\", PLOSOne 3\/20\/2013:<\/p>\n<p style=\"padding-left: 30px;\"><span style=\"color: #000080;\">We report here trends in the usage of \u201cmood\u201d words, that is, words carrying emotional content, in 20th century English language books, using the data set provided by Google that includes word frequencies in roughly 4% of all books published up to the year 2008. We find evidence for distinct historical periods of positive and negative moods, underlain by a general decrease in the use of emotion-related words through time. Finally, we show that, in books, American English has become decidedly more \u201cemotional\u201d than British English in the last half-century, as a part of a more general increase of the stylistic divergence between the two variants of English language.<\/span><\/p>\n<p><!--more--><\/p>\n<p>One odd thing about this interesting paper, as Jamie Pennebaker has pointed out to me, is described in the Methods section:<\/p>\n<p style=\"padding-left: 30px;\"><span style=\"color: #000080;\">We obtained the time series of stemmed word frequencies via Google's Ngram tool (<a style=\"color: #000080;\" href=\"http:\/\/books.google.com\/ngrams\/datasets\">http:\/\/books.google.com\/ngrams\/datasets<\/a>) in four distinct data sets: 1-grams English (combining both British and American English), 1-grams English Fiction (containing only fiction books), 1-grams American English, and 1-grams British English. [&#8230;]<\/span><\/p>\n<p style=\"padding-left: 30px;\"><span style=\"color: #000080;\">For each stemmed word we collected the amount of occurrences (case insensitive) in each year from 1900 to 2000 (both included). [&#8230;]<\/span><\/p>\n<p style=\"padding-left: 30px;\"><span style=\"color: #000080;\">Because the number of books scanned in the data set varies from year to year, to obtain frequencies for performing the analysis we normalized the yearly amount of occurrences using the occurrences, for each year, of the word \u201cthe\u201d, which is considered as a reliable indicator of the total number of words in the data set. We preferred to normalize by the word \u201cthe\u201d, rather than by the total number of words, to avoid the effect of the influx of data, special characters, etc. that may have come into books recently. The word \u201cthe\u201d is about 5\u20136% of all words, and a good representative of real writing, and real sentences.\u00a0<\/span><\/p>\n<p>This matters, because the overall frequency of \"the\" is far from constant &#8212; here it is for Google Books'\u00a0American English, British English, and English Fiction 1gram lists over the course of the 20th century:<\/p>\n<p><a href=\"http:\/\/languagelog.ldc.upenn.edu\/myl\/AmericanBritishLoveHate1.png\"><img decoding=\"async\" title=\"Click to embiggen\" src=\"http:\/\/languagelog.ldc.upenn.edu\/myl\/AmericanBritishLoveHate1.png\" alt=\"\" width=\"490\" \/><\/a><\/p>\n<p>Acerbi et al. suggest\u00a0these significant differences among countries, genres, and times simply reflect\u00a0\"the effect of the influx of data, special characters, etc. that may have come into books recently\" &#8212; and therefore normalizing by <em>the<\/em> counts gives a better picture of word frequency than normalizing by overall token counts. But in fact there's good reason to attribute\u00a0a significant fraction of the differences in the\u00a0frequency of <em>the<\/em>\u00a0to\u00a0real stylistic variation in the language, not just variation in the amount of \"data, special characters, etc.\" in the Google Book sample.<\/p>\n<p>One piece of evidence is the fact that a similarly declining\u00a0pattern can be seen in State of the Union addresses, as discussed in \"<a href=\"http:\/\/languagelog.ldc.upenn.edu\/nll\/?p=9998\" target=\"_blank\">SOTU evolution<\/a>\" (1\/26\/2014) and \"<a href=\"http:\/\/languagelog.ldc.upenn.edu\/nll\/?p=16938\" target=\"_blank\">Decreasing Definiteness<\/a>\" (1\/8\/2015):<\/p>\n<p><a href=\"http:\/\/languagelog.ldc.upenn.edu\/myl\/TheSOTU1.png\"><img decoding=\"async\" title=\"Click to embiggen\" src=\"http:\/\/languagelog.ldc.upenn.edu\/myl\/TheSOTU1.png\" alt=\"\" width=\"490\" \/><\/a><\/p>\n<p>The same trend can be seen in data from the Corpus of Historical American English:<\/p>\n<p><a href=\"http:\/\/languagelog.ldc.upenn.edu\/myl\/COHA_THE1.png\"><img decoding=\"async\" title=\"Click to embiggen\" src=\"http:\/\/languagelog.ldc.upenn.edu\/myl\/COHA_THE1.png\" alt=\"\" width=\"490\" \/><\/a><\/p>\n<p>And as noted in \"<a href=\"http:\/\/languagelog.ldc.upenn.edu\/nll\/?p=22893\" target=\"_blank\">Positivity<\/a>\" (12\/21\/2015), something similar has been happening in MEDLINE text:<\/p>\n<p><a href=\"http:\/\/languagelog.ldc.upenn.edu\/myl\/MedlineB9.png\"><img decoding=\"async\" title=\"Click to embiggen\" src=\"http:\/\/languagelog.ldc.upenn.edu\/myl\/MedlineB9.png\" alt=\"\" width=\"490\" \/><\/a><\/p>\n<p>In \"<a href=\"http:\/\/languagelog.ldc.upenn.edu\/nll\/?p=16991\" target=\"_blank\">Why definiteness is decreasing, part 1<\/a>\" (1\/9\/2015), I presented some evidence that this is due to a secular trend in the direction of greater informality in the written language. In \"<a href=\"http:\/\/languagelog.ldc.upenn.edu\/nll\/?p=17006\" target=\"_blank\">Why definiteness is decreasing, part 2<\/a>\" (1\/10\/2015), I presented some evidence, based on age-grading, that\u00a0a similar change is taking place in (American) conversational speech.\u00a0And in \"<a href=\"http:\/\/languagelog.ldc.upenn.edu\/nll\/?p=17215\" target=\"_blank\">Why definiteness is decreasing, part 3<\/a>\" (1\/18\/2015), I tried to evaluate a suggestion (due to Jamie Pennebaker) that some part of the change might be caused by an increase the \u00a0frequency of 's-genitives relative to\u00a0of-genitives.<\/p>\n<p>Whatever the causes of decreasing definiteness, it's clearly a real change in the language, not just a change in the publishing industry or in Google Books' sampling results. And so if you normalize\u00a0the yearly counts of other words by the yearly counts of <em>the<\/em>, \u00a0you're studying the evolution of the definite determiner\u00a0(and formality, and &#8230;) as well as whatever other culturomic trends you're tying to trace.<\/p>\n<p>How much difference does it make? Well, I suspect that the\u00a0claimed trans-Atlantic emotion gap (\"American English has become decidedly more 'emotional' than British English\") is (at least) exaggerated by the observed trans-Atlantic definiteness gap. I don't have time this morning to replicate the whole Acerbi et al. study, but here's a plot for (case-insensitive) forms of HATE (i.e.\u00a0<em>hate|hating|hates|hated|hater|haters<\/em>), which shows exactly the predicted exaggeration of trans-Atlantic trends:<\/p>\n<p><a href=\"http:\/\/languagelog.ldc.upenn.edu\/myl\/AmericanBritishLoveHate2.png\"><img decoding=\"async\" title=\"Click to embiggen\" src=\"http:\/\/languagelog.ldc.upenn.edu\/myl\/AmericanBritishLoveHate2.png\" alt=\"\" width=\"490\" \/><\/a><\/p>\n<p>And I'd guess, in advance of investigation, that much of the post-1980 HATE boom in the U.S. is due to factors like <a href=\"https:\/\/books.google.com\/ngrams\/interactive_chart?content=hate+crime%2Chate+speech&amp;case_insensitive=on&amp;year_start=1960&amp;year_end=2000&amp;corpus=17&amp;smoothing=3&amp;share=&amp;direct_url=t4%3B%2Chate%20crime%3B%2Cc0%3B%2Cs0%3B%3Bhate%20crime%3B%2Cc0%3B%3BHate%20Crime%3B%2Cc0%3B%3BHate%20crime%3B%2Cc0%3B%3BHATE%20CRIME%3B%2Cc0%3B.t4%3B%2Chate%20speech%3B%2Cc0%3B%2Cs0%3B%3Bhate%20speech%3B%2Cc0%3B%3BHate%20Speech%3B%2Cc0%3B%3BHate%20speech%3B%2Cc0%3B%3BHATE%20SPEECH%3B%2Cc0\" target=\"_blank\">the rise of terms such as\u00a0<em>hate speech<\/em> and <em>hate crime<\/em><\/a>, as well a\u00a0general bleaching of the word HATE towards mere disapproval, as in phrases like <a href=\"https:\/\/books.google.com\/ngrams\/interactive_chart?content=I+hate+to+say+it%2CI+hate+to+tell+you&amp;year_start=1900&amp;year_end=2000&amp;corpus=15&amp;smoothing=3&amp;share=&amp;direct_url=t1%3B%2CI%20hate%20to%20say%20it%3B%2Cc0%3B.t1%3B%2CI%20hate%20to%20tell%20you%3B%2Cc0\" target=\"_blank\">\"I hate to say it\" or \"I hate to tell you\"<\/a>.<\/p>\n<p>So maybe the title of Philip Ball's <em>Nature News<\/em> article \"<a href=\"http:\/\/www.nature.com\/news\/text-mining-uncovers-british-reserve-and-us-emotion-1.12642\" target=\"_blank\">Text mining uncovers British reserve and US emotion<\/a>\" (3\/21\/2015) should have been 'Text mining uncovers British formality and US informality\".<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Alberto Acerbi , Vasileios Lampos, Philip Garnett, &amp; R. Alexander Bentley, \"The Expression of Emotions in 20th Century Books\", PLOSOne 3\/20\/2013: We report here trends in the usage of \u201cmood\u201d words, that is, words carrying emotional content, in 20th century English language books, using the data set provided by Google that includes word frequencies in [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[60],"tags":[],"class_list":["post-23210","post","type-post","status-publish","format-standard","hentry","category-computational-linguistics"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/posts\/23210","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=23210"}],"version-history":[{"count":14,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/posts\/23210\/revisions"}],"predecessor-version":[{"id":23276,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/posts\/23210\/revisions\/23276"}],"wp:attachment":[{"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=23210"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=23210"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=23210"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}