It is with great fanfare that I announce Victor Mair‘s new addition to Sino-Platonic Papers: an expanded set of notes (1.03 MB download, PDF) on his 2007 translation of the Art of War.
The notes themselves are fantastic, but what made me practically fall out of my chair was what he has in the appendix: a complete Manchu version of the Art of War in romanized text. And if that’s not enough, English glosses are given for each word/phrase! The romanization and glosses are provided by Hoong Teik Toh at Academia Sinica in Taiwan. Of all of the Manchu study materials that I’ve seen, this one has got to be the coolest!
And as Mark Swofford says in his announcement on Pinyin.info, this is most probably the longest piece of romanized Manchu text on the web. That makes it like a tiny little corpus (TLC™). So I started playing around with it, doing things that one might do with a corpus….
First of all, I have to confess that I’ve not kept up with something I should have: Natural Language Processing. I started learning Python and the Natural Language Tool Kit, and then stopped. I should really get back to learning it, because I’ll probably keep using it for the rest of my life. It would have been just the thing to use for playing around with this tiny little corpus. Anyway, I had to use what I’ve been using since before the NLTK came around.
Before I proceeded, I converted the romanization system to one that’s more computer-friendly. In the traditional P.G. Möllendorff romanization system, there are two letters with diacritics: š and û. Since these can’t be typed very easily, a modification has appeared on the web (I don’t know its origin) that substitutes every instance of š with x,and every instance of û with v. So I replaced the relevant letters with their more modern equivalents.
I extracted the text and dumped it into MS Word, cleaned it up by removing anything that’s not a Manchu word, put all of the words into one column, dumped it into Excel, and then sorted and subtotaled to produce a word frequency list. I also did a character frequency list.
The character frequencies:
Char | Freq | Char | Freq | Char | Freq | Char | Freq | |||
a | 4182 | o | 1690 | h | 891 | v | 565 | |||
e | 3897 | g | 1645 | d | 880 | f | 520 | |||
i | 3393 | r | 1577 | s | 836 | y | 411 | |||
n | 2074 | m | 1261 | c | 804 | w | 183 | |||
b | 1975 | t | 1021 | l | 680 | x | 144 | |||
u | 1941 | k | 984 | j | 569 | z | 16 |
Total 32,139 Characters.
Looking at this, also it seems that the Chinese sound [ʣ] is represented 16 times, and that it is romanized as dz. Scanning through the text, it looks like it is perhaps exclusively used for the name Sun Zi.
[080922 – I noticed that in the word frequency list, “dz” only appears 14 times. There is another word that appears twice: dzu. I can’t find this in any dictionary, and the glosses that are given don’t seem to mention it. Does anybody have an idea what this might mean?]
The top 100 words (of course raw and unlemmatized):
Rank | Word | Frq | Rank | Word | Frq | Rank | Word | Frq | Rank | Word | Frq | |||
1. | be | 457 | 26. | adali | 30 | 51 | horon | 18 | 76 | inenggi | 13 | |||
2. | de | 239 | 27. | geren | 30 | 52 | ojorongge | 18 | 77 | irgen | 13 | |||
3. | i | 189 | 28. | mangga | 30 | 53 | ejen | 17 | 78 | ningge | 13 | |||
4. | cooha | 135 | 29. | ume | 30 | 54 | sarkv | 17 | 79 | ubu | 13 | |||
5. | kai | 132 | 30 | ere | 29 | 55 | tuwa | 17 | 80 | uttu | 13 | |||
6. | oci | 101 | 31 | niyalma | 29 | 56 | waka | 17 | 81 | abkai | 12 | |||
7. | tuttu | 87 | 32 | sembi | 29 | 57 | emu | 16 | 82 | gaifi | 12 | |||
8. | ofi | 81 | 33 | doro | 28 | 58 | etembi | 16 | 83 | inu | 12 | |||
9. | ombi | 69 | 34 | etere | 28 | 59 | juwan | 16 | 84 | mergen | 12 | |||
10. | ba | 68 | 35 | gurun | 27 | 60 | komso | 16 | 85 | sain | 12 | |||
11. | bata | 67 | 36 | muterakv | 27 | 61 | musei | 16 | 86 | tesei | 12 | |||
12. | bi | 53 | 37 | jakanaburengge | 25 | 62 | neneme | 16 | 87 | wesihun | 12 | |||
13. | afara | 51 | 38 | na | 24 | 63 | alime | 15 | 88 | afaci | 11 | |||
14. | akv | 51 | 39 | sunja | 23 | 64 | hendume | 15 | 89 | arga | 11 | |||
15. | ojorakv | 48 | 40 | bade | 22 | 65 | muke | 15 | 90 | bucere | 11 | |||
16. | aisi | 46 | 41 | niyalmai | 22 | 66 | sara | 15 | 91 | etuhun | 11 | |||
17. | urunakv | 42 | 42 | seme | 22 | 67 | yaya | 15 | 92 | gvnin | 11 | |||
18. | serengge | 41 | 43 | ojoro | 21 | 68 | dz | 14 | 93 | ici | 11 | |||
19. | bime | 36 | 44 | ergi | 20 | 69 | mutembi | 14 | 94 | ohode | 11 | |||
20. | coohai | 36 | 45 | giyvn | 20 | 70 | sun | 14 | 95 | sejen | 11 | |||
21. | arbun | 35 | 46 | dahame | 19 | 71 | yooni | 14 | 96 | erebe | 10 | |||
22. | jiyanggiyvn | 34 | 47 | ilan | 19 | 72 | babe | 13 | 97 | eterengge | 10 | |||
23. | urse | 34 | 48 | muse | 19 | 73 | ceni | 13 | 98 | goloi | 10 | |||
24. | ci | 32 | 49 | terei | 19 | 74 | fiyelen | 13 | 99 | goro | 10 | |||
25. | baitalara | 31 | 50 | baita | 18 | 75 | hvsun | 13 | 100 | haksan | 10 |
Someday perhaps I’ll be inspired to add meanings, but for now I’ll just make an observation. In most word frequency lists that I’ve seen, most of the commonest words are function words. For a highly inflected language like Manchu, we can expect lexical words (or content words) to appear sooner. In this list, five of the first twenty are lexical words: cooha (military), bata (enemy), afara (battle), aisi (benefit), and coohai (military (attrib. form)). Making a word frequency list on a tiny little corpus such as this can give us some insight into what a text is all about.
si saivn?
sun dz i coohai doro bithe umesi sain! sinde ambula baniha!
bi sini asuba (website) be cihalambi!
acaki
bucin
Hi
I’m Jin, a student. I also consulted question on your Facebook page. According to this article, it seems that you support the romanization of Manchu language?
But as far as I know, there is no standard for transcription.
May I consult how do you do that?
Just very curious.
Thank you.
Jin
Möllendorff romanized Manchu and that has become the standard that most scholars use. On the internet, a slight modification of his system has come into use, where unused roman letters are used instead of letters with diacritics: x instead of š, v instead of ū. This is far more convenient to type (for those of us who normally type in languages that don’t use too many diacritics).