Language Log

Finding non-Roman letters and characters in an MS Word document

December 10, 2016 @ 3:07 pm · Filed by Victor Mair under Language and computers, Typography

Somebody asked Mark Swofford to help her devise a speedy, easy way to locate all the Chinese characters in a book-length manuscript that she was working on. Mark set to work on the problem, and this is what he came up with:

"How to find Chinese characters in an MS Word document" (12/10/16)

It just so happens that the manuscript in question was for a book that I am editing. My publisher needed ALL the Asian characters converted to a font called SimSun and also needed everything done on a PC instead of a Mac (which my colleagues and I use) so as not to cause problems interfacing with its printing operation. Even the IT people at Penn could not come up with a workable solution for making the conversion. So my editorial assistant faced the prospect of sitting in front of a computer for days and going through the large manuscript manually finding and changing each character. Instead of putting her through that tedium and torture, I wrote to Mark and asked him if he could figure out a solution, since he had previously helped us with many other programming challenges in the past.

Mark is in Taiwan, so he works while we sleep. We sent the request to him one evening a few weeks ago and, voila! The next morning when we woke up, Mark had it all figured out, saving us an enormous amount of boredom and stress.

This method can also easily be changed to fit the Greek alphabet, or Cyrillic, etc.

December 10, 2016 @ 3:07 pm · Filed by Victor Mair under Language and computers, Typography

Permalink

9 Comments

January First-of-May said,

December 10, 2016 @ 7:27 pm

It's a rare day when I can legitimately say that the OTT (xkcd Time thread) did it first (or at least earlier):

"And yes, indeed, all of the Cherokee in the Thread (thus far) is ultimately the result of that post. Using Emacs' isearch-forward-regexp command <…> and searching for the pattern /[Ꭰ-Ᏼ]/ (that's Cherokee "A" through Cherokee "YV"), I readily verified that Cherokee characters have thus far only appeared in…"
(mrob27 [i.e. Robert Munafo], OTT:1491:18 [i.e. reply 59618], September 22, 2013)

Of course Emacs and MS Word are not especially similar, but really the only thing Mark Swofford did that Robert Munafo didn't is find an equivalent of that Emacs command for MS Word. (Which might well have been the hardest part, so I applaud him for it anyway.) It's a pretty obvious solution really (actually I'm half sure that I've heard of it being used in some other context too).
quink said,

December 10, 2016 @ 8:43 pm

It's a character range search, so it won't find other places in Unicode where Han Unified Characters are found – specifically the newer extra ranges outside the BMP.

It also includes non-Chinese, the example mentions Katakana. Plus, the groupings in the character map don't correspond cleanly to code point ranges.

The only sane way is to use Unicode Character Classes available in better regular expression engines.

So, step 1 – don't use Word. Step 2 – Make a regex that needs to include little more than just \p{Han.}
quink said,

December 10, 2016 @ 10:35 pm

http://www.regular-expressions.info/unicode.html for more info on that.
Jonathan Smith said,

December 11, 2016 @ 1:18 am

I do this kind of thing routinely, but with more specific ranges; for the last few years from 㐀 to 﨩 has been about right. But this is for more specific stuff like changing font size selectively. For the specific problem described, for heaven's sake simply select all text > change to SimSun (or whatever) > change back to TimesNewRoman (or whatever).
tangent said,

December 11, 2016 @ 3:55 am

In "singular 'they'" usage discussion, the first sentence of this post I would probably have written myself as:

"Somebody asked Mark Swofford to help them devise a speedy, easy way to locate all the Chinese characters in a book-length manuscript that they were working on."
Rubrick said,

December 11, 2016 @ 5:17 am

I had a few spare minutes, so I wrote a quick regex that replaces all occurrences of Windows-based publishing establishments with sensible alternatives.
James Wimberley said,

December 11, 2016 @ 6:18 am

Memo to the sysop of the buggy beta simulation we are in: please take up Rubrick's suggestion, for politicians. You can set the bar for "sensible alternatives" pretty low.
Keith said,

December 12, 2016 @ 3:33 am

I am ever so slightly disappointed to read that "the IT people at Penn could not come up with a workable solution for making the conversion". I can only hope that this is because a very competent team of IT professionals deal with things like networks, print and mail servers, end other "infrastructure" tasks, and were not recruited for desktop office application support.

Because as others have pointed out, this is really nothing more complicated than searching for characters that appear within a certain range. It is the kind of thing that many, many people run into. Even system and network administrators encounter these situations, but I suspect that those at Penn, not being MS Word specialists just threw up their hands and said "I can't think how to do that".

But on the other hand, it's great that having figured out a way to do it, the method is now published for everybody to learn from; that's what T B-L invented the Intarwebs for, right? Not for publishing photos of cute kittens…
Smith said,

December 14, 2016 @ 10:38 am

I'm with Jonathan (other) Smith here. Unless the problem is more complicated than was described in the post, if it is simply a matter of changing the font of the ZH text while utilizing a different font for the Roman characters, indeed: select all, assign ZH font to entire document, then assign whatever Western font you desire to the entire document. All the ZH text will be in the assigned ZH font (in this case, the detestable SimSun) while the Roman text will be in the chosen Western font. Takes seconds.

RSS feed for comments on this post

Finding non-Roman letters and characters in an MS Word document

9 Comments

January First-of-May said,

quink said,

quink said,

Jonathan Smith said,

tangent said,

Rubrick said,

James Wimberley said,

Keith said,

Smith said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta