Does frequency matter?

Size certainly matters, when it comes to making handwritten characters look pretty on the page. Take a look at this second grade diary entry:


Horizontally and vertically complex characters (雕 and 像 as examples marked A and B) require more forethought to squeeze into their allotted space. Although she’s doing better than a year ago, clearly it’s a skill my daughter has yet to master (although she’s light-years ahead of me).

But what really jumped out at me was the Pinyin. Why did she fail to write these particular characters?

  1. huá 滑
  2. shī 湿
  3. shāo 烧
  4. tao 葡
  5. tao
  6. lìng 另
  7. zhào 照

Don’t get me wrong. I’m all in favor of using Pinyin and not interrupting the writing process by looking up characters (parallel to the English problem of focusing on spelling over content). I’m just curious: does character frequency predict which characters she fails to write correctly?

Here are the frequency stats, using Jun Da’s character frequency database:

  • huá 滑 = 1480 (i.e. it’s the 1480th most frequent character)
  • shī 湿 = 1743
  • shāo 烧 = 1201
  • tao 葡 = 2130
  • tao 萄 = 2210
  • lìng 另 = 489
  • zhào 照 = 443

Superficially the answer seems to be no: frequency does not do a terribly good job of predicting character-writing failure (CWF*, i.e. the inability to remember how to write a character). The lowest frequency characters, 葡萄 (pútao = grape) make sense; lìng 另 and zhào 照 do not. But there are some obvious issues:

  1. Not enough characters here to make a decent prediction.
  2. The corpus isn’t a very good fit. Jun Da says the Modern Chinese character frequency list is created from the types of texts…

Informative: Computer science, economics, education, government, health, history, law, military, news, philosophy, politics, popular science, religion, etc.
Imaginative: General fiction, children, detective, drama, history, Kongfu or martial arts, military, prose, literary review and science fiction, etc.

… that are not exactly representative of what a second grader has been encountering during the last five years.

Still, this sample is enough to form a conjecture, which is the first half of scientific method, right?

My conjecture is that raw frequency, even if using a well-matched corpus, will not do a great job of predicting CWF. Here are two other factors I think might be important:

  1. Frequency of components. Take very common character 得, for example. Zev Handel made the point in comments the other day that the bottom-right component (寸 with an extra 一 on top) is unusual because it was historically misanalyzed, “the bottom right part of 得 is pretty weird — as far as I know it doesn’t occur in any other character”. This might be a test case to see if 得 experiences CWF more frequently than its frequency would suggest.
  2. Existence of semantically-related, phonetically-identical syllables that are graphically distinct. A good example might be the two huāng characters 荒 and 慌. They don’t mean the same thing, but they’re related. Moreover, their frequencies (1328 and 1650 respectively) are close. So I’d predict these would be discombobulated more often than their raw frequencies would suggest.

Anyone know of literature or other conjectures on this topic?

Oh, and btw: anyone want to point to other corpuses** and frequency analyses out there for Mandarin?


*CWF is really awful, I agree. Stigmatizing with the “failure” and so on. I’m completely open to better terms.

**No, not “corpora” dammit! One unusual plural per sentence is enough

PS: As we talked about this post, my daughter gave several big self-directed duhs at not having gotten lìng 另 and zhào 照. She shrugged her shoulders at pútao 葡萄 though, which she hasn’t learned to write yet. Clearly frequency has some relationship, even a strong relationship, to character writing ability. I’m just saying it’s not the only thing.

9 responses to “Does frequency matter?”

  1. pc says:

    While not completely related to your search for corpora, there is the Penn Chinese Treebank

    One thing I was curious about was the development of handwriting.
    From what I’ve seen (i.e. I’m probably wrong) is that there are three stages of handwriting: print, semi-cursive, and cursive.

    Your daughter seems to have print handwriting, most notably 不 and 还 both have 4 distinct strokes for 不. However, in my 2nd yearish Chinese class many students, myself included, write in the semi-cursive manner (不,好,的 etc. have all become more or less one or two continuous strokes) Of course, my teacher writes in the occasionally illegible completely cursive script.

    Is there any particular time where the “real Chinese learner” (i.e. in China learning Chinese) starts to develop the quicker/looser characters?
    I can’t imagine it’s because she hasn’t learned them well enough or hasn’t had enough practice writing them. My guess is that it’s a teacher related thing, that is, “Well-formed characters show intelligent thought” or something similar.

    (p.s. There’s a missed kù above the A)

  2. Syz says:

    pc: Maybe Randy Alexander will eventually answer your “real Chinese learner” question in his blog about how native speakers learn the script. I’ll bet it’s very much curriculum-determined here in the PRC, not so much based on the whim of individual teachers.

    Strange I missed kù = 裤. I remember noticing it on the first read… Looking it up now, it’s #2090, so pretty low frequency for that corpus. But I’ll bet it’d be more common if we had a corpus for 2nd graders.

  3. NielDLR says:

    Ah this is a great post. I have just started studying corpus linguistics last week in my General Linguistics Uni course, and I’ve been looking for some Chinese corpora.

    But I totally agree, there is no relevant 2nd grade corpus, however I’m tempted to contact the guy over at Slow-Chinese to create a corpus out of his works, ’cause it has a very intermediate style. Might still not be relevant to a second grader, but definitely more relevant to foreign language learners of Chinese.

  4. Zev Handel says:

    Thanks for sharing this text with us! It’s fascinating.

    I would guess that characters whose phonetic component differs significantly from the character’s pronunciation, or whose phonetic component is associated with lots of different pronunciations, are harder to remember. This may be a factor that works together with frequency.

    In the examples above, I think the phonetics for huá 滑
    shī 湿 and shāo 烧 are all pretty hard to remember because they don’t sound like the reading of the whole character, unlike, say, characters with 方 or 青 as phonetic.

  5. John says:

    I’m happy to see you recognizing that the corpus isn’t a good fit. I frequently talk to learners that see frequency data as some kind of truism, like the value of π or something. The good news is that spoken language corpora for Mandarin are coming! (Those should be a bit closer to the language a second grader is exposed to.)

  6. Chris says:


    1 Your handwriting will evolve depending on how much time you spend on it, taking calligraphy classes and reading up on different styles helps too. Be careful it also digresses with less practice, mine sucks these days

    2 Frequency definately has a relationship to the ability to write and recognize characters stemming from the neurological traits of our brain, every impulse forms a pathway the more the same impulse is given the better the pathway, hence better memorization, however, it comes down to creating these pathways, in my experience it helps to go about it in different ways for example tun:屯 i first remembered by connecting it to the barstreet in Beijing and go on and go on, just to say that difficult characters have to be memorized in different ways, some even need full stories for they are very obscure or rarely used outside of their context.

  7. Syz says:

    John, hey, don’t hold out: where are those spoken Mandarin corpuses going to be? (Or are they available already?)

    Zev, the phonetics! Kicking myself for not having thought of that. Some of them work really well and some just suck. I’ve been keeping a list in my Anki flashcard deck, actually. I could post it some day but maybe there’s a more methodical way. If we could find good data on miswritten characters, that would be a good start.

  8. John says:


    Sorry, not trying to be crpytic… I have the information on my other computer back in Shanghai, and I’m in Florida now.

  9. Syz says:

    John, cool. Enjoy the palm trees and we’ll talk corpuses when you’re back.

Leave a Reply