The notes themselves are fantastic, but what made me practically fall out of my chair was what he has in the appendix: a complete Manchu version of the Art of War in romanized text. And if that’s not enough, English glosses are given for each word/phrase! The romanization and glosses are provided by Hoong Teik Toh at Academia Sinica in Taiwan. Of all of the Manchu study materials that I’ve seen, this one has got to be the coolest!
And as Mark Swofford says in his announcement on Pinyin.info, this is most probably the longest piece of romanized Manchu text on the web. That makes it like a tiny little corpus (TLC™). So I started playing around with it, doing things that one might do with a corpus….
First of all, I have to confess that I’ve not kept up with something I should have: Natural Language Processing. I started learning Python and the Natural Language Tool Kit, and then stopped. I should really get back to learning it, because I’ll probably keep using it for the rest of my life. It would have been just the thing to use for playing around with this tiny little corpus. Anyway, I had to use what I’ve been using since before the NLTK came around.
Before I proceeded, I converted the romanization system to one that’s more computer-friendly. In the traditional P.G. Möllendorff romanization system, there are two letters with diacritics: š and û. Since these can’t be typed very easily, a modification has appeared on the web (I don’t know its origin) that substitutes every instance of š with x,and every instance of û with v. So I replaced the relevant letters with their more modern equivalents.
I extracted the text and dumped it into MS Word, cleaned it up by removing anything that’s not a Manchu word, put all of the words into one column, dumped it into Excel, and then sorted and subtotaled to produce a word frequency list. I also did a character frequency list.
The character frequencies:
Total 32,139 Characters.
Looking at this, also it seems that the Chinese sound [ʣ] is represented 16 times, and that it is romanized as dz. Scanning through the text, it looks like it is perhaps exclusively used for the name Sun Zi.
[080922 – I noticed that in the word frequency list, “dz” only appears 14 times. There is another word that appears twice: dzu. I can’t find this in any dictionary, and the glosses that are given don’t seem to mention it. Does anybody have an idea what this might mean?]
The top 100 words (of course raw and unlemmatized):
Someday perhaps I’ll be inspired to add meanings, but for now I’ll just make an observation. In most word frequency lists that I’ve seen, most of the commonest words are function words. For a highly inflected language like Manchu, we can expect lexical words (or content words) to appear sooner. In this list, five of the first twenty are lexical words: cooha (military), bata (enemy), afara (battle), aisi (benefit), and coohai (military (attrib. form)). Making a word frequency list on a tiny little corpus such as this can give us some insight into what a text is all about.