本字、正字 and consistency in transcription

This keeps coming up with transcription work. The question is, when transcribing a person speaking their local dialect, what characters should you use? I provide the following definitions, which are up for debate:

本字 běnzì – The character that most accurately represents the word in etymology. In a way, it shows the cognates.
正字 zhèngzì – The “Standard” character. That which represents the meaning of the intended word for a wider audience.

As a semi-hypothetical example, Dialect X has a word that means “high” or “tall”, read “huan”. It’s cognate with Mandarin 懸 xuán as any educated speaker will tell you. A speaker of Dialect X may write it as 懸, or they may just write 高. They wouldn’t say 高 gāo or a cognate of 高. But then they may assume the rest of the country which doesn’t speak their dialect might not know 懸 as having this meaning, since in Standard Mandarin 懸 means “to hang”. So if you can imagine, they’re still writing in their dialect, but they’ve changed the characters to make it just a little easier to read for a wider audience.

I like to think of it as being like יהוה‎ read “Adonai” in the synagogue, even though that’s not what it spells. Or better yet, think of a novel where there’s that one Hakka speaker; You still need to know what he’s saying, but the author also wants to capture a feeling of his Hakka-ness.

In this case, 懸 would be 本字, while 高 is 正字. Another example would be Cantonese 唔, which is not cognate with Mandarin 不. 唔 Is the right character, but it may be written 不 for the same reasons as 懸 may be written 高.

Then there’s one more kind of character. I don’t know what they’re really called. I call them 音字, but I’m probably the only one. This would be when a word is cognate with the Standard Mandarin character for the word, but for one reason or another, a different unrelated character is used. For example we may see 上海宁 written on ads in Shanghai, meant to mean “Shanghainese person” where 人 is replaced with 宁 níng. 宁 doesn’t otherwise have this meaning. Instead it’s used as a sort of phonetic representation, probably for the sake of non-Wu speakers. 人 and 宁 are pronounced exactly the same in Shanghainese: /ɲɪɲ/. Using 宁 essentially serves the purpose of in-group association. You know it’s Shanghainese, and if you’re Shanghainese yourself, maybe you can feel a little smug. It’s an orthographic shibboleth.

This use of 宁 is neither 本字 (It’s not the original character for “person”) nor 正字 (since it’s not Standard Mandarin, but used to distinguish from Mandarin). In transcription, this sort of “音字” character should obviously never be used, since it only detracts from the intended meaning of the utterance.

I brought up 唔 as a negational adverb in Cantonese in part because Cantonese has a much stronger written tradition than other dialect groups in China. But even the many characters used in Cantonese writing aren’t without problem. A quick search returned a BBS post entitled 广东话的 “佢” 的正字可能就是 “伊” (Cantonese 佢 might actually be 伊). To quote:

原来闽南语也是说” 他 ” 为” 佢 ” ,和粤语发音一样, 但写出来是 ” 伊 “, ” 伊 ” 在上古汉语指的是” 他 “, 如果闽南语是写的对,那” 伊 ” 就是” 佢 ” 咯!

In Min one also says “he” as “佢 (/i/)”, the same as the Cantonese pronunciation, but they write “伊”. In classical Chinese, 伊 referred to “he”. If Min is correctly written then 伊 is 佢!

Note the use of the term 正字 above meaning the thing that I’m calling 本字.

佢 is common in Cantonese. 伊 in Min and Wu (although 佢/渠 appears in some dialects), though some sources transcribe Wu “him” as 其. There’s a whole article on Wikipedia for personal pronouns in Sinitic languages which goes into far more detail than I should here. Ultimately, which is it? Another area is negation, where in Wu we might see 不, 弗 or more commonly 勿. Jerry Norman considered the Wu term to be cognate with Mandarin 不, but you’ll rarely see that character used in written Wu, and various dictionaries of Wu dialects disagree on the relationship between /vəʔ/ and 不.

My own policy is to use 本字 whenever available, and to use 音字 (or whatever it’s actually called) when the word lacks a written character, usually a result of being a borrowing from a non-Sinitic substrate language. For example a She 畲 word that seeped into Wenzhou dialect wouldn’t have an “original character”, so something else would have to sit in.

For more on this, see Problems in Comparative Chinese Dialectology: The Classification of Miin and Hakka by Branner and to a lesser extent Written Taiwanese by Klöter.

And if anyone knows what to call what I’ve referred to as 音字, please do let me know in the comments.

  1. Karan says:

    Hmm, I’m not sure if those terms are consistently used everywhere to have those meanings, though you might define them as such.

    For example, “正字” specifically in Cantonese is often used to mean “the etymologically correct character”. So, for example, the word “to come” is etymologically “蒞” (lei⁶) according to the sources I’ve checked, but it’s almost without exception either written as “嚟” (lei⁴) or “來” (loi⁴), of which the former you might call an “音字” and the latter being a “對應字”.

    Now, I wouldn’t propose that one write Cantonese texts using “正字” because that would make a simple sentence like “我係你嘅朋友嚟嘅” (“我是你的朋友呀”) into “我係你忌朋友蒞忌” which would be incomprehensible to the majority of Cantonese speakers.

    So, in the end, what you’d want to use are the “widely accepted 字s” which might fall into either 音字, 對應字 or 正字. However, the important point is that those are the characters that native speakers would deem “correct” for writing the modern form of the language. It’s akin to using “donut” instead of “doughnut”, i.e., a widely-accepted-yet-etymologically-lacking “variant” spelling.

    • Kellen says:

      You and I talked about that a little bit (last week, was it?). The definitions above are as they get used in Taiwan with Hakka (and maybe Min). Admittedly these aren’t universal definitions.

      I’m confused about 忌 in the sentence “我係你忌朋友蒞忌”. Is that a typo? It’s not otherwise a 同源詞 with any possessive marker I know.

  2. Karan says:

    It’s not a typo. It’s just an almost-never used 正字. The theory is that it started out as “之” (which I’m sure you’ll appreciate) and then due to phonetic changes, was written as “忌” and much later “嘅”. If you do a search for “嘅 忌 正字” you’ll find some sites/fora talking about this, such as this one: http://tieba.baidu.com/p/862684076

  3. Matt says:

    This is a great post.

    In response to Karan’s last comment—wouldn’t 之 then be what Karan calls 正字 and Kellen calls 本子, with 忌 just being an early example of whatever 嘅 is? So, assuming 嘅 really does have an etymological relationship with 之, the sentence could be written, “我係你之朋友蒞之”?

    • Karan says:

      Yes, that’s correct. I actually didn’t know that 之 was a candidate for being the 正字 for 嘅 until just now after doing more research. I don’t know if it really is the 正字 though, because this is just based on something somebody posted in a forum.

    • Kellen says:

      this was my thought as well

  4. Matt says:

    That makes sense to me.

  5. Matt says:

    Also, I have to say that I prefer 本字 (sorry about the typo above) for the character which most accurately represents the word etymologically, because 正字 is not only confusing due to the fact that it seems to be used differently in discussions of Cantonese & Hakka, but can also mean “correctly-written character” or 楷书, among other things. I don’t have a proposal for a less-confusing term for the second category, though. And 本字 also means “original form of a graph” in palaeography, so I guess that could be confusing, too…

    • Kellen says:

      And really, for where this matters, we’re pretty set on 本字 anyway. The real issue is always knowing what those are.

      The reason this has consumed so much of my time these past few weeks comes down to usefulness in computational analysis of transcriptions. I’d need to have a good handle on 本字 linked to transcription if I’m going to be able to show a range of pronunciations for a single set of cognates. To do that, I need the computer to know they’re cognates.

      • Matt says:

        Wow, that would be very useful.

      • Karan says:

        Determination of cognates will be complicated I imagine and I suppose one would need to keep track of “cognates” even within the same language. For example, 蒞/嚟 and 無/冇 are both pairs of two words that are cognates but now have different readings and senses. But because the first of each of these pairs is a cognate with, say, Mandarin, would the second also be a cognate? Or be separately classified as a second-degree cognate?

        So, in addition to the different pronunciations of a single character in the different Sinitic languages, which itself will probably be a decent bit of work, one might also want to see a two-dimensional graph connecting the dots between these various cognates, such that one might, for example, link the Cantonese 冇 to the Mandarin 無.

  6. Zrv says:

    I think the Chinese habit of using a written form — the Chinese character — as a short-hand way of representing cross-dialectal cognate sets has led to no end of trouble, and has really set back the progress of comparative dialectology. We’re probably stuck with it, but it’s a terrible idea. It distorts and oversimplifies complex historical phenomena, causes confusion because of partial overlap with standard orthographies for Mandarin, Classical Chinese, and other Chinese varieties (such as Cantonese), fails to distinguish between learned reading pronunciations of characters and etymological connections to characters, and places excessive emphasis on the medieval lexicographic tradition rather than the full scope of Chinese language history. It gives us the illusion that once we’ve identified a so-called “本字“ we have resolved all etymological issues associated with a word. And yet, as the discussion above has shown, it’s extremely hard for anyone to actually come to an agreement about what an “etymological” character representation really means.

    Imagine the problems that would arise if we tried to use citation forms of written Latin words to stand for Indo-European cognate sets. What would you then do with English words like “brother” and “fraternal”? Like “shirt” and “skirt”? Like “karaoke”?

    Also, doesn’t this idea imply that we need to start transcribing Mandarin using “之” substituted for “的”?

    This is a problem I’ve thought about a lot, and I don’t have a good solution. The habit of transcribing non-standard Chinese varieties in characters is deeply embedded in the practice of Chinese dialectology.

    • Kellen says:

      Excellent points as usual.

      Also, doesn’t this idea imply that we need to start transcribing Mandarin using “之” substituted for “的”?

      I’d say no, it doesn’t. 之 and 的 don’t have the same usage, and both exist in the same dialect. It’s safe to say they’re different words with a similar origin and usage, but like shirt and skirt, aren’t the same in a contemporary form of speech.

      白/文 distinctions are another issue, as you point out. For our purposes I guess I can just cross my fingers and hope it will be apparent which is being presented for a given utterance. With Wu that’s not too tough in a lot of cases, but I can see it getting out of hand, and fast.

      I think the Chinese habit of using a written form — the Chinese character — as a short-hand way of representing cross-dialectal cognate sets has led to no end of trouble

      I take your point, and for the most part I agree. On the other hand I think there is some small value in the practice. Certainly for things like machine translation, Google Translate tends to do much better with Chinese than Korean (with inputs containing a large number of Chinese loan words), and I can’t help but think that characters are part of that.

      • Zrv says:

        Kellen, thanks for your response. My comment about “之” and “的” was meant to be rhetorical — it shows the kinds of problems that can arise for any dialect if one seeks to transcribe always with so-called 本字. Every time there are doublets of the shirt-skirt type, ambiguity in transcription would arise. The over-generalized traditional wén-bái distinction is of course in reality only one manifestation of lexical layering that we find in different Chinese varieties — “之” and “的” doesn’t fit into that categorization.

        As for Google translate, that’s a different issue entirely, unrelated to the transcription of spoken Chinese varieties. I didn’t mean for my criticism of the use of characters in linguistic comparison to be taken as a criticism of Chinese characters as an orthography–i.e. a functioning writing system for native speakers–for Standard Written Chinese (or written Cantonese, or written Shanghainese, or written Taiwanese, etc.). There’s tremendous value in written differentiation of homophonous morphemes. It’s one of the under-appreciated advantages of English spelling.

  7. gummyworm says:

    There isn’t an official set of characters (正子) in transcribing Cantonese. Instead, written cantonese has been based on a de facto standard of vulgar character (俗字). By the way, the 本字 of 佢 is more likely to be 其 rather than 伊.

    • Kellen says:

      Not sure anyone said there was an official way, but there’s definitely some serious consensus in the community.

      As for 佢/其/伊, I’m not sure you can say that. It’s still debated among historical linguists, for as much as they might actually care. Mostly I think it doesn’t matter much. Convention tends to be that 佢 is for the south and 伊 is central dialects, though in some cases (like Wenzhou) 佢 may be used for a central dialect, but in this case I believe it may be a result of contact. That said, for our purposes at Phonemica, 其 wouldn’t be a good choice. 佢 or 伊 would be more useful for the computational side of things.

      Anyway if you’ve got a reliable source on that I’d love to see it.

