Character vivisection (衍)
I know you’ve all got character peeves. You must, because I do, and I’ve got all the visual aesthetic of the Generic Brand product line manager. The character 罚, for example, always seems strained to me, with the lower half looking like one of those typesetting snafus where a single word gets j u s t i f i e d across an entire line.
But this isn’t about peeves, especially 罚, because then someone might get excited about how it looks better in traditional characters (罰) and this would turn into a simplified vs. traditional free-for-all. We definitely don’t want that.
This is about tools for character analytics.
How does one go about finding another — or, alternatively, demonstrating that there exists no other — character like 衍, which has a 氵 three-drop water (三点水) in the middle of the character. More narrowly: how does one go about doing it without just asking people? How can you do it systematically?
The only reason the question came up, of course, was that the character just looks wrong to me. I mean: there are a gazillion characters with the three-drop water on the left, but I can’t remember seeing any that have it in the middle. Too bad that “can’t remember” doesn’t mean much for me, with my limited hanzi and faulty memory.
So maybe I can look it up, I thought. But the way I know how to use a dictionary doesn’t really work. You can look up characters that contain the three-drop water, but the dictionary then lists only characters where three-drop water is considered the key radical, which means that characters like 衍 itself are excluded because its radical is actually 行. Presumably the same could be true for other characters that have a three-drop water in the middle.
Eternal gratitude (maybe even fame, if we ever get the Sinoglot “Tools” page running) to the reader who can point me towards a more robust character decomposition tool.
UPDATE: From comments below, consider Wenlin or NJStar or this Wikimedia Commons project for character decomposition.
And here’s a great followup from Zev about 衍 itself:
Syz. I think your discomfort with it stems from the unusual structure of the character. It has two components: 行 and 水. 行 is one of the rare components that, as a radical, surrounds phonetic components rather than sitting beside (or above or below) them. (There are other radicals that surround components, like 囗, but these don’t split up when they do so.)This is why we have a whole slew of characters like 衝,衛, 術, 街, etc.
It just so happened in the development of the standard script from earlier seal script that elements like 水 which were inserted in this way ended up in their abbreviated rather than their full form, presumably because they sit to the left side of another part of the character.
But that’s not the end of the story. If you compare 衍 to the other 行-radical words I’ve listed above, you’ll notice there’s still something odd about 衍. For the other characters, the middle component is a phonetic element — they are all typical 形聲字, or semantic-phonetic compounds. But 衍, pronounced yǎn not shuǐ, does not appear to be of this type. Indeed, the Shuōwén Jiězì classifies it as huìyì, a semantic compound character. Well then, you may ask, if 行 and 水 are playing equal semantic roles here, why didn’t the character get originally created as the more natural-looking 洐, with the water part on the left? The answer, perhaps, is that 洐 already existed; it writes xíng (行 is the phonetic).
I use wenlin for that, it has the capability of showing characters that include certain components. This way, I found for example this baby: 匯, which should at least half count
To save me a lot of words, here a picture of how one would do this: http://skitch.com/phyrex/n5qe4/wenlin
I also remember having seen a post on sinosplice about electronic dictionaries (which also discussed wenlin – that way you should be able to find it), which also discussed the character (de-)composition feature of one such tool. That should do the trick as well.
I had thought the person behind zhongwen.com must have a breakdown like that…I seem to remember being able to look up a character based on any part of it, but now I can’t find where to do that.
Yeah, Wenlin is your best bet. There are also lists of character decompositions online that can help. Also, consider: 衔.
PS: Here you go:
嗨, 愆, 琺, 桫, 铴
There might be more, I didn’t look carefully
@Max: This is cool, so don’t count my next comment as ungrateful because it’s not about you… the problem with Wenlin is that I don’t have it. Not only that, but when I go to their website, money in hand, they tell me that they want to ship a CD to me. No download. Wenlin, are you listening???!!! People want to give you money but you won’t pull yourself out of 1990s technology so that you can accept it! Not a good business model.
@Karan: can you point to any of these online character decompositions? are they half as useful?
@Max, btw, from the Jun Da corpus, the most frequent of your sandianshui-in-the-middle characters is #3294. Then there’s 5208, 6255, and not found. Given that, I forgive myself for not having them immediately pop to mind 😀
Problems, problems, always problems! 😉
If you don’t want to get Wenlin by .. other means.. just order it from amazon – should be there in a day or two?
Also, here’s the excerpt from sinosplice that I was talking about:
—
NJStar Chinese Word Processor 4.35
http://www.njstar.com/
NJStar also has a Asian language viewer, but it’s been rendered pretty much completely unnecessary with internationalization advancements in Windows and other operating systems. The main draw is the word processor.
I’ve always found the dictionary that comes with the NJStar word processor to be virtually useless. NJStar’s saving grace is its radical lookup method. It consists of a chart containing all possible radicals (and even some that aren’t technically official). You click on the radicals within the character that you can identify. Here’s the good part: It doesn’t matter if they’re the character’s main radical or not. With each radical you identify, the list of possible matches at the top grows shorter until you can easily pick out the character. You can also limit matches by total number of strokes.
NJStar Chinese Word Processor’s radical lookup method is the best by far of any software I have seen. Everywhere else it’s lacking, however.
— [Source: http://www.sinosplice.com/life/archives/2004/01/31/wenlin-30 — I used the google cache version though]
I haven’t tried it (Mac user and all ), but chances are that they’ve already entered the 21st century. Give it a try.
Two more rare characters that you would hate:
啵 bo ( used in middle Chinese and dialects to denote request, command, etc.; similar to modern Mandarin 吧)
磲 (砗磲/车渠) chēqú 1. giant clam; tridacna. 2. mother of pearl)
Yeah, Wenlin is fantastic for character decomposition.
I got NJStar running on my Mac, and it works as advertised: http://skitch.com/phyrex/n5qt4/njstar-radical-lookup-simplified-chinese
And, especially for you, I checked: Yes, they do allow credit card/paypal payment and downloading the software! 😉
I’m glad you brought up the character 衍 (yǎn, ‘overflow’), Syz. I think your discomfort with it stems from the unusual structure of the character. It has two components: 行 and 水. 行 is one of the rare components that, as a radical, surrounds phonetic components rather than sitting beside (or above or below) them. (There are other radicals that surround components, like 囗, but these don’t split up when they do so.)This is why we have a whole slew of characters like 衝, 衛, 術, 街, etc.
It just so happened in the development of the standard script from earlier seal script that elements like 水 which were inserted in this way ended up in their abbreviated rather than their full form, presumably because they sit to the left side of another part of the character.
But that’s not the end of the story. If you compare 衍 to the other 行-radical words I’ve listed above, you’ll notice there’s still something odd about 衍. For the other characters, the middle component is a phonetic element — they are all typical 形聲字, or semantic-phonetic compounds. But 衍, pronounced yǎn not shuǐ, does not appear to be of this type. Indeed, the Shuōwén Jiězì classifies it as huìyì, a semantic compound character. Well then, you may ask, if 行 and 水 are playing equal semantic roles here, why didn’t the character get originally created as the more natural-looking 洐, with the water part on the left? The answer, perhaps, is that 洐 already existed; it writes xíng (行 is the phonetic).
@Max: thanks for the research, maybe I’ll get it together and install…
@Alex: cool. “Hate” is too strong, of course, it’s just that I find the endless stream of characters makes me feel hopeless sometimes.
@Zev: now THAT is vivisection on a living, changing character. I put your comment up above in the post for those folks too lazy to read comments.
Zev’s got it.
Wenlin is a good tool for looking up characters by component — Pleco also has this functionality. The guy who works on Wenlin, Tom Bishop, is working on an XML-based language for character descriptions called the Character Description Language, and it strikes me that something like that — something that “knows” where character components are in relationship to other character components — might be capable of doing what you’re looking for. Unfortunately I think it hasn’t yet been implemented anywhere — the next version of Wenlin, maybe?
There are component dictionaries available, though as I haven’t used them myself, I can’t attest to their usefulness.
That said, I’ve been looking for ages for character component databases that are in the public domain, if possible for traditional, simplified and Japanese script, if anyone knows anything about this, I’d be very glad…
@A: funny you mention that. I have some vague recollection of a web-based tool that was pretty good with character parts. But that was years ago and I can’t find it now either. Maybe it was the same one.
Max just told me about a great resource in the public domain:
http://commons.wikimedia.org/wiki/Commons:Chinese_characters_decomposition
Wow, wow, wow!
@Chrix, thanks for sharing. Max saves the day! I love good stuff in the public domain. Even though it’s not user friendly, just a cursory glance seems to indicate it would take care of the lookup problem I was having with the sandianshui. The next question is whether some technically-minded person has put this into (or wants to put this into) some sort of databasey format at least for armchair pseudo-techies like me to use. Very cool.
The same people behind it have also made it a python library, as Max was happily demonstrating to me how to look up stuff (they linked it with Unihan and other free databases, so you can search for pronunciation too), have a look at the possibilities here: http://code.google.com/p/cjklib/wiki/Screenshots
The Wiki table is not very human-friendly, but it can be read by a computer without problem, so you can easily import it into Excel, or a real database programme and then process it further.
No, those are different people, and different composition data, but the cjklib is just AWESOME!
Have a look at the usage examples here: http://code.google.com/p/cjklib/wiki/Screenshots
The one that is relevant to your question is this:
—
$ cjknife -p 口土
吐呿咥哇垕㖏㙂㙅唑唗唟㖫哩啀啈啩㖶㗌䞤亴喔喹堡臵超㙜䞦䞧䞩喱嗑塣趌㗧䞫嘊嘡塾臺趗㙮䞳䞸嘢嘵墪趟㘁㙱㙳㙵㯧儓噇噻趦㘆㸀兣嬯懛擡薹䠟嚜嚡檯趫穯籉趮囈
—
Is that cool or what? 😀
Ah, I see, thanks for clarifying that, Max… So that means we’d have to contact two people if we wanted to use this in a citable manner, as to what sources the data is based on…
Perhaps not the case for this particular character, but examples of this sort could also be seen as 彳+ (氵+ 亍). This doesn’t work for 衍, because as far as I know “氵+ 亍” doesn’t exist (which proves conclusively that 氵 was really added to the middle), but it is certainly the case for a few of the others mentioned above, like 嗨.
John, I think for many characters, there are different ways of breaking them up into components.
On the talkpage of the Chinese Character Decomposition Project, they give an interesting example, where an etymologically based and a computationally based approach might be at loggerheads:
– For example, 雖 means “as big as a lizard”, and should be decomposited to “虫” and “唯”, where 虫 is the Radical, and 唯 is the pronounciation. Another examples, “臨” should be decomposited to “臥” and “品”. “發” to “弓” and “癹”.
– This might depend on the purpose of decomposition data. For instance, a computer program to generate stroke orders or graphical glyphs for Chinese characters would work properly for 雖 only using the graphical decomposition into 虽 and 隹. Perhaps it is necessary to distinguish, for some characters, between an etymological decomposition and a graphical one.
(from
http://commons.wikimedia.org/wiki/Commons_talk:Chinese_characters_decomposition)