Hyphenation with words containing capital letters
« previous post | next post »
A truly startling (and surely unintended) hyphenation in the print edition of The Economist (March 11th) suggests that some updating of word-breaking algorithms is in order in the light of the fairly recent practice of inventing product and brand names that have word-internal upper-case letters. An article about juvenile delinquency, reporting that kids are less involved in crime in part because they're indoors playing video games, ends with this paragraph (I reproduce the line breaks and hyphens of the UK print edition exactly, though not the microspacing that justifies the right-hand margin; the only thing I'm interested in is the end of the penultimate line):
The decline in crime among the young
bodes well for the future. A Home Office
study in 2013 found that those who com-
mitted their first crime aged between ten
and 17 were nearly four times more likely to
become chronic offenders than those who
were aged 18-24, and 11 times more likely
than those who were over 25. More PlayS-
tation, less police station.
Even for a word like workstation, it would be very odd to hyphenate it as works- (line break) tation, but I guess the algorithms that decide on where to hyphenate in narrow-column typesetting do not contain full details of all the stem boundaries in compound words in English (treetop, daylight, workstation, lunchroom, teacup, cutthroat, typesetting, update, and hundreds of thousands of other words) and where the boundaries of their components are.
I know very little about hyphenation algorithms (comments below are open so that truly nerdy readers who know about word-processing and typesetting can enlighten me), but my guess is that breaking in a way that leaves a possible syllable each side of the break is favored over breaking either before or after a cluster of consonant letters. Thus
There needs to be a rule in there that says in effect, "Avoid at all costs a hyphen immediately after a word-internal capital letter" — and perhaps the program should also favor a hyphen break before any such word-internal capital.
Laura Morland said,
March 13, 2017 @ 3:51 am
100% in agreement with your proposed rule.
Speaking of rules, thanks for breaking your own against opening up your posts to comments. When I read the post on my email just now, I sighed to myself, saying, "There's a typo, but Pullum never* allows comments, so I can't let him know."
Mais, voilà : "of course I canNOT reproduce the microspacing that they use to justify the right-hand margin;"
_____________
* the "never" is simply my experience; I'm sure to have missed other exceptional instances such as this one
James Wimberley said,
March 13, 2017 @ 5:18 am
The Economist's algorithm would presumably allow the self-referential "clusterf* [ line break ] *k".
David Marjanović said,
March 13, 2017 @ 5:52 am
Do you mean it would be very odd to hyphenate it as works- tation?
Dick Margulis said,
March 13, 2017 @ 6:26 am
Hyphenation algorithms work with a main dictionary, a custom dictionary, and rules for everything not listed. The rules work for most words but not for all, as your example shows, and so the user is typically reminded to add the word to the custom dictionary. Such algorithms do not analyze syntax to determine whether one is using the noun prog-ress or the verb pro-gress.
A further complication is that dictionaries disagree on word division (one example: En-glish in M-W; Eng-lish in AHD), and if the embedded dictionary the software defaults to is different from the publisher's preferred dictionary, a sharp proofreader will catch the discrpancies.
Ray said,
March 13, 2017 @ 6:35 am
(another stylistic ouchie: apparently the economist spells out numbers 10 or less, even in a sentence where there's such a mix of numbers above and below ten that are being compared.)
reader_not_acedeme said,
March 13, 2017 @ 6:55 am
Oh, I can't help but smile at the "difficulties" of English hyphenation. Here are two uncalled-for factoids from the Hungarian rulebook.
1) As a general rule, you hyphenate to keep syllables intact. Except in cases where you have clear internal consituents, such as a verbal particle: then you you keep the integrity of these. So you have "megint" as an adjective with no internal structure meaning 'again', which you hyphenate as me-gint. And you have it as particle+verb meaning 'admonish', which you hypehanate as meg-int.
2) Not the same kind of hyphenation, but: compound nouns. Like German, Hungarian's got long 'uns. If they exceed 6 syllables and 3 constituents, you insert a hyphen in the "logical" place. But if it's 6+ syllables but only 2 constituents, no hyphens. If you add more constituents, the "logical" place changes and the hyphen starts wandering around.
In short, for hyphenation, you need a pretty darn good morphological analyzer, plus some alogirthmic logic on top that would be too difficult to express in any formalism for morphology, so it's just hand-coded.
Eleanor said,
March 13, 2017 @ 7:14 am
The Oxford Quick Reference Spelling Dictionary, which includes word breaks for line-end hyphenations, has 'work|station' as the break point, with a secondary option of 'worksta|tion' if the first isn't possible. (I'm looking at the 1998 edition; I'd be surprised if it's changed since.)
spellchick said,
March 13, 2017 @ 7:48 am
They must be using the same software as the Washington Post print edition, where I regularly encounter exactly this type of nonsensical hyphenation. It is not limited to words with an internal capital letter, and while the error is more common with brand names and the like, I've also seen it in perfectly good dictionary words. I believe it is always a "one-off" error. It is quite distracting.
Kyle Gorman said,
March 13, 2017 @ 7:52 am
@Geoff: not all speakers even syllabify "cluster" the way you apparently do: a significant minority appear to prefer clu-ster to clus-ter.
flow said,
March 13, 2017 @ 7:53 am
This is SO Microsoft Word (1998 version) hyphenation rules for German, the difference being that back in the day, Word would drive me nuts with its insistence on pulling the Fugen-S (https://en.wikipedia.org/wiki/German_nouns#Compounds, https://de.wikipedia.org/wiki/Fugenlaut) to the next line even if that resulted in an impossible consonant cluster, such as in Bahnhof|svorplatz or Antritt|srede.
A quick check whether 'svo-' or 'vo-' is the more likely start of a German orthographic syllable would've done the trick (for many words, but not all, e.g. Arbeits|amt vs ?Arbeit|samt). Likewise, what algorithm makes a program decide that PlayS|… and |tation are more likely to occur in an English text than Play|… and |Station? One would want to imagine this to be the glitchy, unforeseen side-effect of some other, helpful and valid yardstick to find good hyphenation opportunities, but what could that be? Was maybe lettercase discarded early on, and the algo decided that 'plays' makes a perfectly cromulent verb?
@Ray—"(another stylistic ouchie: apparently the economist spells out numbers 10 or less, even in a sentence where there's such a mix of numbers above and below ten that are being compared.)"—this. Occurs between ten and 20 times to me, daily.
Johan P said,
March 13, 2017 @ 8:35 am
Why would it be odd to hyphenate workstation as work-station? It's both the clearest for the reader and follows the construction of the word exactly.
In fact, moving from a system which is based around syllables to one which is based around morphemes is not unreasonable, and will often be the most easy to read once people are used to it. I know the language recommendations for Swedish were recently changed to follow that pattern.
PS. No matter what the algorithm does, which is very often wrong, it's part of the page editors' and proof readers' job to fix this kind of obvious error. I think we had about two pages on various hyphenation rules in the manual I worked from when I edited pages for a newspaper.
John Roth said,
March 13, 2017 @ 9:33 am
To expand slightly on what Dick Margolis said: in extremis, the proofreader can insert a soft hyphen into the word; this ought to override all dictionaries and rules.
Gabe Burns said,
March 13, 2017 @ 9:42 am
@Johan P I suspect, as did David Marjanovic (apologies for the missing diacritic) above, that Prof. Pullum meant to say that it would never be hyphenated as "works-(linebreak)tation", given that the article is about The Economist's choice to hyphenate PlayStation as "PlayS-(linebreak)tation". "Work-(linebreak)station" (or "Play-(linebreak)Station"), on the other hand, seems like the most reasonable hyphenation choice.
Dick Margulis said,
March 13, 2017 @ 10:04 am
A few notes:
1. @Kyle: The topic is hyphenation software, which depends on how a given dictionary syllabifies words or how internal rules are expressed, not how speakers syllabify words (as much as that might be the ideal).
2. @spellchick: Newspapers area special case. There are four conflicting constraints: a narrow column; justified type; a maximum allowed word space (an aesthetic constraint to avoid pigeonholes); and correct syllabification. Depending on the sophistication of the hyphenation and justification (H&J) algorithm used by the newspaper, the thing that tends to break first is correct syllabification. So you see breaks in one-syllable words (stren-gth, for example) pretty frequently.
3. @John Roth: Proofreader? What proofreader? In books, yes. In newspapers? Not any more. The copy editor, if there still is one, is the last person to see the copy before it's printed. In less frequent periodicals, there may or may not still be a budget for proofreaders. Check the masthead to be sure.
Picky said,
March 13, 2017 @ 10:04 am
As others have said here, the sub-editor (copy editor) concerned should override unsuitable system hyphenations during the editing process. The subs' desk should also inform system management of additions needed to the hyphenation dictionary. That's what they did when I was involved in the business more than a decade ago, but on the other hand that was before it was discovered that money wasted employing sub-editors could be better spent rewarding the hard work of shareholders.
DWalker07 said,
March 13, 2017 @ 10:19 am
Yes, I'm also confused about this:
"Even for a word like workstation, it would be very odd to hyphenate it as work- (line break) station,"
Why would that be odd?
Dick Margulis said,
March 13, 2017 @ 10:24 am
@Picky: I'm not aware of any opportunity for the copy editor (who edits copy before it is composed on the page) to see the composed copy before it is printed. That used to be the case, back in the day of the slot and the rim and hot metal galleys. But I don't think it's the case anymore.
Johan P said,
March 13, 2017 @ 10:38 am
In many modern newspapers and magazines, the roles of copy editor and page designer are conflated. Proofreader too, of course.
Lane said,
March 13, 2017 @ 10:46 am
What Picky said. (I'm at The Economist.) Our software hyphenates many words automatically, but often puts the hyphen in an oddball place. We're supposed to override that (as when the software might make overr-ide, because "-ide" is a common scientific suffix) when we spot it. This is a howler that didn't get spotted.
One correction: we don't have subeditors as such; only section editors and proofreaders. (And that's always been the case.) We haven't cut editors or any journalistic staff; in fact, we've slightly expanded the numbers.
Fun fact, we fix such things with what's called a "dishy". After about a year, I learned where the odd name came from: it's a "DIScretionary HYphen".
Robert Ayers said,
March 13, 2017 @ 11:11 am
The Economist of 25 February contains this hyphenation:
toysh-
ops
It looks like compound words are difficult over there.
BZ said,
March 13, 2017 @ 11:17 am
When I was in school I was taught that when a line break is needed it should be put in between to consonants if possible. Of course in case of PlayStation, there are three consonants in a row, but maybe the "y" is less preferable because it's not always a consonant?
Rodger C said,
March 13, 2017 @ 11:41 am
Ca. 1982, I was unemployed and an older friend whose mother had been a newspaper columnist suggested I apply for a job as a proofreader. I had to tell him there was no such thing any more–only what were then exceedingly primitive algorithms that filled the papers with off-ended lan-downers.
Gregory Kusnick said,
March 13, 2017 @ 2:26 pm
Just guessing, but perhaps the algorithm chose to hyphenate PlayS-tation after the pattern of ges-tation or infes-tation rather than work-station.
Idran said,
March 13, 2017 @ 4:40 pm
@BZ: "Y" isn't a consonant in "Playstation", it's a vowel.
John said,
March 13, 2017 @ 10:24 pm
One thing, it seems, that hasn't been considered is avoiding confusion of the name "PlayStation" itself. In other words, "PlayS-[line break]tation," to me, refers unambiguously to the unhyphenated "PlayStation." The more logical hyphenation in "Play-[line break]Station" leaves the name ambiguous as to whether it's "PlayStation" or "Play-Station."
Long URLs present this problem, too: A long URL with a hyphen and a line break is ambiguous as to whether the hyphen is a part of the URL; a URL with a [dot] immediately before a line break misleads the reader to think the sentence is over.
I agree that, in this case, the result isn't particularly pretty, but it certainly avoids ambiguity of "PlayStation" versus "Play-Station."
Vic said,
March 14, 2017 @ 1:31 am
It's been about 30 years since I last worked on newspaper hyphenation software, but here's what I remember:
1) The first system I worked on had a hyphenation algorithm, which we ran against a version of Webster's New World Dictionary that the company had on magnetic tape, and all the words which didn't hyphenate by algorithm were entered into an exception dictionary.
2) When I was tasked with rewriting the software for Finnish, the people at the newspaper laughed at me when I asked for a dictionary to use to generate the exception dictionary – they politely informed me that the algorithm they had provided properly hyphenated all Finnish words. For special cases (e.g., foreign words), they would use a discretionary hyphen to show where to break the word if necessary. But usually foreign words (I remember in particular Ronald Regan's name) were written as though they were Finnish as modified with noun declension.
3) There were some differences between what was considered a proper hyphenation point versus syllable breaks. Mostly this involved having a minimum number of letters at the end and beginning of a line, but there were some other exceptions; it was more than once explained to me by a typographer, but I don't remember the lesson.
Adam F said,
March 14, 2017 @ 4:43 am
@Dick Margulis
I went looking to see what "pigeonholes" are in typography (pretty much what I'd guessed) and the first sensible hit I found was yours!
Typesetting myths you should have gotten over by now
RP said,
March 14, 2017 @ 5:39 am
@Kyle,
"@Geoff: not all speakers even syllabify 'cluster' the way you apparently do: a significant minority appear to prefer clu-ster to clus-ter."
This doesn't necessarily mean clu-ster would be a good way to hyphenate. Ideally, the reader seeing the first half of the word in isolation shouldn't be misled as to its pronunciation.
jaap said,
March 14, 2017 @ 8:00 am
The classic hyphenation error in Dutch in the word "minister". Instead of breaking at either of the syllable breaks in mi-nis-ter, it is all to often hyphenated as mini-ster (which reads as if it means small star). You'd think the newspapers would know this by now, but they still often get it wrong.
Another classic is "bommelding". This is a compound noun (bom-melding = bomb notification) but is easily mispronounced and nonsensical if hyphenated in the syllable break of the second part (bommel-ding).
Susan said,
March 14, 2017 @ 8:24 am
@Idran: I am a native speaker of English. In "Play", the "y" is actually a consonant, in both written and spoken English. In spoken English, it is technically the consonantal second half of a semi-vowel, even though linguists claim the pattern here is CV. Imagine "yes" (CVC), where it is clear to see the "y" acting as a word-initial consonant and "e" acting as a vowel.
spellchick said,
March 14, 2017 @ 9:39 am
My question (ok, rant) for those who have worked on this issue is not "why aren't the results perfect?" but rather "why are current systems producing results noticeably inferior to publications from 10-15 years ago?"
RP said,
March 14, 2017 @ 5:31 pm
@Susan,
It is certainly true that most English speakers consider the letter "y" to be a consonant. There is a case for arguing that "play" ends with a consonant in written English. But what is the basis for your claim that, in the word "play", the "y" is a consonant in "spoken English"? I would say that in spoken English, the word "play" ends in a diphthong consisting of the two vowels /eɪ/. The letters "ay" represent that diphthong. I don't generally hear a consonantal /j/ at the end of "play". Would you argue that "café", "sensei" and "anime" end in consonants in "spoken English"? It seems to me that they all end with exactly the same /eɪ/ sound that "play" does. (Some or all of them may end with /e/ on its own or a different sound in their original languages, but not in most people's spoken English.)
Rubrick said,
March 15, 2017 @ 3:50 pm
BTW, Prof. Pullum has confirmed that he 'intended to write "works-tation" but it was too awful an error for my fingers to type'. (And he's fixed it in the post.)
Mark S said,
March 17, 2017 @ 8:12 am
@Dick Margulis:
"Such algorithms do not analyze syntax to determine whether one is using the noun prog-ress or the verb pro-gress"
I'm no expert, but I wouldn't have expected the part of speech of "progress" to affect how it would be hyphenated, only its pronunciation; but I'm from the UK, so I say "PRO-gress" for the noun, and "pruhGRESS" for the verb.
Dick Margulis said,
March 17, 2017 @ 10:33 am
@Mark S: Reiterating, we're talking about word division as dictated by the publisher's preferred dictionary and enforced by the publisher's proofreaders, not syllabification based on introspection by a native speaker. This is a question about how H&J algorithms work rather than about actual linguistic concerns. And if you look in an AmE dictionary of repute, you'll see the distinction made with pro-gress|prog-ress that I identified. BrE dictionaries may not make that distinction. A quick glance at one online Oxford dictionary doesn't show syllabification at all, so I can't really judge.
spellchick said,
March 19, 2017 @ 8:57 am
An excellent specimen from today's print edition of the Washington Post:
"low-energy" required a line break, and appeared as "lo[linebreak]w-energy".
ASM said,
March 19, 2017 @ 9:15 am
At about the time I first used a word processor, I vaguely remember that Donald Knuth had something to say about hyphenation in "TEX and Metafont". A bit of Googling has brought me to the following thesis by one of his pupils; Frank Liang.
https://tug.org/docs/liang/liang-thesis.pdf
There you can see just how automatic hyphenation (in English) is a problem that is meaty enough to warrant a thesis.
That early word processor brought me into contact with a man who, as an employee of IBM in the 1970s, had attended international conferences on hyphenation. In his recollection hyphenation (in most languages other than English) was already amenable to computer processing at that time.