So you want to count unique Chinese characters in a document…

On why you should check out Chad Redman’s corpus tools at zhtoolkit.

Does this chart qualify as corpus porn?

Microsoft Excel non-commercial use - Book1  [Compatibility Mode] 152011 41849 PM.bmp

One of the benefits of rampant piracy is having access to digital texts for pretty much any popular novel in Chinese. Finding a text is a wee bit harder than a Google search — it helps to go to Baidu and you have to dodge all the “we just want your email” scams — but so far I’ve been able to get the books I want without skanky computer viruses or an inbox full of ENlarGement ads. Most recently, with a little creative cut and pasting, I’ve managed to get nearly a full copy (just missing the last few chapters) of 《兄弟》,  Brothers, a novel I’m reading right now.

The purpose of digital copies isn’t to save money — for my actual reading I use a real book for which I paid genuine RMB at a legit-looking bookstore — but digital is awfully handy in looking up a phrase you remember, or exploring word usage, or whathaveyou.

It’s also perfect for adding characters to my Anki hanzi deck, complementary reinforcement for the same hanzi I’m already reading. I set up* an Excel spreadsheet that looks at a section of the text, pulls out all the characters that aren’t already in my deck, then creates cards that have the character on the front, and on the back show the Pinyin and a sentence or two of the context in which it appears in the book. Better than any translated “definition” of a character!

But then I started getting greedy. Because what I really want to do, of course, is analyze the text. I want to be able to do all sorts of wordcounts and frequencies and fun stuff — but without doing the hard work of learning Python or something. Oh, I know, I know — Python is sooo easy! But then, learning some simple piano tunes is easy too. So is juggling, really, or sub-one-minute Rubik’s cube solves, or… Yeah, that’s what I wondered too: why does stuff that’s so easy take so much time?

And easy is what brings us to the seduction of the chart above.

That table is showing the number of instances of each character in the first quarter or so (~50k characters) of 《兄弟》 / Brothers. The chart was generated automagically by this übercool tool at zhtoolkit. The only minor complaint is that it wouldn’t take the whole text — seemed to max out at around 60k characters.

Why am I so awed? Maybe I shouldn’t be, but for me it’s been surprisingly difficult to find any tool that would count unique characters, let alone an online tool that would do it for large quantities of text. For example, there’s a  corpus toolset at Jun Da’s site (Jun Da hosts one of the few publicly available analyses of corpus word and character frequencies). But zhtoolkit allows orders of magnitude more text (I couldn’t get Jun Da’s to work except with trivial quantities — a few paragraphs) and it’s easier to use.

AND, potentially of most importance: the unique character count is only a subfeature of the tool‘s main purpose. In fact, here’s a repeat of the instructions for getting unique character counts from the Chinese Forums post where I found it originally:

You just need to uncheck all the dictionaries and known word lists that are checked by default. If it doesn’t use its dictionary, it doesn’t know how to segment words, so it results in single characters.

In other words, what the tool wants to do is segment your text into real multisyllable words. How does it do this? I talked to Zhtoolkit creator Chad Redman, and he said:

The segmentation algorithm itself is just a simple longest match, but it works better than one would expect it to. The biggest gap in the tool is that it relies on the words being in CC-CEDICT, otherwise the words get segmented as single characters.

Note that it also takes a stab at identifying 成语 (chéngyǔ = idioms) and even names. I haven’t played much with those features yet, so I’ll be curious to hear how others find it. In the meantime, here is a preview of some of the other things the tool does, results from using about 30k characters of the book, 《兄弟》:

Highlights places in the text where the tool identifies only one character, not multiple syllable words

In effect, this means that the yellow highlights are more likely to be mis-parsings. It’s not hard to find mis-parsings already, but it’s easy to see how even this first run would save a lot of work for someone who was trying to parse a whole text into individual words.

zhtoolkit - wordlist - results - Google Chrome 192011 74619 AM.bmp

Parses text into words, counts instances

Yes, it even has a sortable header row! The “Freq. per 1 million words” column gives the frequency of a given word in a sizable corpus (the Lancaster Corpus of Mandarin Chinese). The “No. occurrences” column is even more fun. Taking a look at the screenshot here, the most frequent words/characters all belong in characters’ names except for 屁股 (pìgu = butt). If you don’t know why I won’t spoil it here. You don’t have to get more than a page or two into the novel to find out.

zhtoolkit - wordlist - results - Google Chrome 192011 75640 AM.bmp

Anyone know of better / complementary tools out there? I’ve got this one posted onto Sinoglot’s tools & resources page now and hope to find others to join it.

——–

*My own Excel spreadsheet is rather more manual than I make it sound here. If you really think it might be useful to you, I’m happy to give you a copy, just send me an email (syz) at (sinoglot) dot (com). But be warned that it’s well beyond User Unfriendly — it’s downright hostile. Really. Awful.

13 responses to “So you want to count unique Chinese characters in a document…”

  1. I have been looking for a proper corpus parser for Chinese for a while now and this is pretty much brilliant. I got so frustrated, that I was on the verge of trying to code my own, but this is just excellent. Thanks!

  2. Syz says:

    Glad to pass on the word! Like I said, I’m still surprised there aren’t more tools like this available out there.

  3. Jean says:

    Syz : I am using electronic texts too, in order to load them into Pleco and read during my commute. However, I am not impressed by the quality of the copy I found, there are a lot of typo. I guess it will not change the results of your analysis as you seem to focus on common characters and a few 他 instead of 她 or 的 instead of 得 won’t change the overall ranking. Still, be careful before doing any real work.

    Actually, I find it fun to be able to see if the writer used a pinyin-based method or some kind of character based method (五笔, 郑码, this kind of thing).

    To find a text, Google does work fine, and the filetype:txt filter can work wonders. My first try is on http://ishare.iask.sina.com.cn though, as it is simpler and cleaner. Text files will more often be on one piece, avoiding some painful copy and paste of texts meant to be read online and split on multiple pages to maximize ads exposure.

    On the geeky side, I guess no one wrote a parser like this because it is too easy. Your chart with single characters can be obtained with a single command line (on Linux or using Cygwin under Windows) :

    sed ‘s/(.)/1n/g’ xiongdi.txt | sort | uniq -c | sort -rn

    It does not choke on long texts, so I can end the suspense here and give the winner for the whole text : 的 with 5011 occurrences ! The start of the list is as follow :

    5011 的
    3716 了
    3021 他
    2298 头
    2296 李
    2252 一
    1892 着
    1715 光
    1650 在
    1639 宋
    1605 个
    1557 是
    1433 们
    1430 来
    1379 上
    1276 地
    1260 说

    There are around 2367 characters in the whole text (I removed some punctuation and Arabic numerals, but I may have forgotten others). 364 of these are used only once.

    Of course, the word parsing part is a little bit harder and would need a small program. Still the greedy search approach seems easy …

  4. Syz says:

    Jean, thx for the load of info

    painful copy and paste of texts meant to be read online and split on multiple pages to maximize ads exposure

    uh, yeah, that was my 兄弟 experience.

    re errors: i haven’t read enough of the digital text to know, but makes sense if they’re typing them in by hand?! wtf? Haven’t the pirates heard about scanners and ocr? Anyway, good to remember the error possibilities for detailed work.

    re book searching: thanks for the tips. i’ll check this out and maybe replace my current incomplete copy of 兄弟

    re s/(.)/1n/g’ xiongdi.txt | sort | uniq -c
    If that’s about my mother, you’d better take it back!
    But seriously: this is exactly my complaint about things that are “easy”. For a guy that hasn’t touched a command line since DOS (and then only probably to pirate some asinine video game), this might as well be Latvian written in Han’gul. Give me a window to copy and paste into! Still, it looks impressive, and thanks for the list 😉 good to see that 李光头 hasn’t lost his position in the top ranks.

    re: 2367 unique chars — sounds awfully low. Even lower if you skip the 364 one-hit wonders. This in itself would be an interesting book-by-book comparison. I think I once looked at 黄金时代 and that it had 3200 or something. Maybe I remember wrong though…

  5. Jean says:

    “re errors: i haven’t read enough of the digital text to know, but makes sense if they’re typing them in by hand?! wtf? Haven’t the pirates heard about scanners and ocr? Anyway, good to remember the error possibilities for detailed work.”

    Of course you are right, sorry for the stupid comment, I just didn’t think it through. The errors in these books must be OCR errors. It is just that I am learning Zhengma right now, and it makes me think differently about character input ;o). I saw a post with a scan of the 人民日报 where 温家宝 was spelt 温家室 (http://www.sinovision.net/blog/3346laorongshu/details/54417.html). I also read some online novels typed by wannabe writers and these ones have pinyin errors. Then I just mixed up everything ! (Actually, I recommend these online novels for someone wanting to read real stories but afraid of literature. OK, they are often quite cheesy, but they are simpler to follow and quite rewarding.)

    Concerning the low number of characters : It is a characteristic of Yu Hua works that has already been noted elsewhere : http://paper-republic.org/ericabrahamsen/yu-hua-fun-fact/ (even if they started with an exaggerated figure)

    Using my command line, I found 1909 unique characters for both 许三观卖血记 and 活着. 黄金时代 is also a pretty simple (and short) read, I found 1803 unique characters. On the other hand, a book like 丰乳肥臀 by 莫言 runs at a more impressive 3957 unique characters. That’s coherent with what I felt when trying to read it (well, before I gave up at page 5 …)

    Following of geeky aside : The command line is actually very easy if you know some Unix syntax. The | sign chains programs together so we can break it down. Each program has a simple goal :
    * sed puts each character on a line of its own, because the other tools work line by line and not character by character (OK for this one we need a little bit of regular expression so it is harder)
    * sort will sort all the lines (we don’t really care about the order, we just want to group the similar characters together)
    * uniq takes each different line only once. With the -c appended, it adds the number of line similar to this one
    * sort will sort a second time the result. The -n appended tells it to deal with numbers, so 1000 is after 999. The ‘r’ tells it to reverse the order so that the larger numbers are at the top

    I agree you can’t really expect everyone to write that, but it does reduce the incentive to write such a program because almost everyone able to write a program should know this kind of stuff. It means all the time would be spent on doing the interface, which is the boring part ;o)

  6. Syz says:

    Jean, I’m really astounded by these low counts. I’ll have to see if I can find my old method and figure out what went wrong. Yeah, 3957 is a headful of hanzi.

    Thanks for the brief syntax lesson. I may decide, in the end, that this is one “easy” thing I really need to learn. Maybe first I’ll force myself to give up on one-minute cube solves.

  7. Chad says:

    Despite my fearful expectations, Google Image Search has exactly 2 results for “corpus porn”, and they are both from this blog.

    Syz, thanks for such a nice summary of this tool! It’s been a work in progress for a while, so I hadn’t publicized it much except for that one forum post. But as an avid maker of flashcard lists, I’ve used this a lot, and I’m glad to hear other people find it useful also. It is a little flaky at its current location–it’s bound to happen when sucking the whole of CC-CEDICT into memory on a shared hosting server :)

    I ran the sed command on Syz’s Xiong Di file. The counts were 2868 unique characters over 268,816 total characters. The number is surprising, I think, because people hear they need 2-3,000 characters just to read a newspaper, so they assume they would need around 5-10,000 for *real* fluency. Hey–Zhonghua Zihai has 85,000 characters, so learning 10,000 is getting off easy, right? But the 1 million plus characters of the Lancaster Corpus, which is comprised of many different types on texts, use fewer than 5,000 unique characters. For a book half the size of Xiong Di, I have seen a count of around 2700 characters, so 2868 may not be unreasonable.

    “Because what I really want to do, of course, is analyze the text. I want to be able to do all sorts of wordcounts and frequencies and fun stuff — but without doing the hard work of learning Python or something.”

    I know nothing about R (and it’s as unsearchable as you’d expect), but it’s fascinating how the introduction to Quantitative Corpus Linguistics with R addresses some of the same points you’ve mentioned: worrying about statistical discrepancies and getting bogged down with programming. I’ve been tempted to buy it for a while but I’ll save it for when I have some time to devote to learning it.

  8. Jean says:

    Oups, my mistake on the 兄第 character count, the whole book was split in two parts and I used only the first one … I also find 2868 “real” characters for the whole book.

    This does not seem too low to me. I just think you need more than 2000 characters to read both a newspaper and a book. A newspaper will talk about a lot of topics, and each will have its specific vocabulary, which wouldn’t be very useful for a book. On the other hand a book like 黄金时代 does have characters that will not appear in a newspaper, all the familiar words and all the action verbs with the 扌 radical.

    I just tried a comparison : I took a few articles from 新京报 and compared them to 兄弟 : my articles contained 1274 unique characters, of which 76 were not in 兄弟. This includes rare characters for the names of the journalist, but also characters common for a newspaper like 厘,垄,域,届,屏,慈,捐,暑,核,疆,碍,筑,繁,趋,酝,酿,锐,… A larger number of article would show hundreds of such characters.

  9. John Pasden says:

    Wow, this is an amazing resource! Thanks so much for calling attention to it.

    As for other corpus-related resources, have you seen this one? http://corpus.leeds.ac.uk/query-zh.html

  10. Syz says:

    @Jean: good to know that the counts agree, then. I fully agree with your point about newspapers. My qualitative experience is that newspapers are harder to read than novels, especially if you are just picking up a random article (might be less hard if you’re reading an article about a subject you’re familiar with). Maybe I’ll pick up on your ideas here and try to do another post quantifying that experience.

    @John: Thanks for the corpus interface page. I remember playing around with that once upon a time, but then I forgot about it. Should be useful. The thing I love about Chad’s tool is you can use it with whatever text you want — in effect create your own personal corpus of stuff you actually read. If we can just overcome that pesky character limitation…

    BTW, I am still waiting for someone to take up your call for a spoken Mandarin corpus. Keep hammering that theme!

  11. Interesting tool but everytime I try to submit a query I get a 500 Internal Server Error – any idea why?

  12. Oh, wait, it seems I’m putting too much in the text box. Be nice if they let us know the character limit…

  13. Tamar says:

    Bookmarked.Thanks. This will make me even lazier than I am. 😀

Leave a Reply