The Art of War — in Manchu!

It is with great fanfare that I announce Victor Mair‘s new addition to Sino-Platonic Papers: an expanded set of notes (1.03 MB download, PDF) on his 2007 translation of the Art of War.

The notes themselves are fantastic, but what made me practically fall out of my chair was what he has in the appendix: a complete Manchu version of the Art of War in romanized text.  And if that’s not enough, English glosses are given for each word/phrase!  The romanization and glosses are provided by Hoong Teik Toh at Academia Sinica in Taiwan.  Of all of the Manchu study materials that I’ve seen, this one has got to be the coolest!

And as Mark Swofford says in his announcement on Pinyin.info, this is most probably the longest piece of romanized Manchu text on the web.  That makes it like a tiny little corpus (TLC™).  So I started playing around with it, doing things that one might do with a corpus….

First of all, I have to confess that I’ve not kept up with something I should have: Natural Language Processing.  I started learning Python and the Natural Language Tool Kit, and then stopped.  I should really get back to learning it, because I’ll probably keep using it for the rest of my life.  It would have been just the thing to use for playing around with this tiny little corpus.  Anyway, I had to use what I’ve been using since before the NLTK came around.

Before I proceeded, I converted the romanization system to one that’s more computer-friendly.  In the traditional P.G. Möllendorff romanization system, there are two letters with diacritics: š and û.  Since these can’t be typed very easily, a modification has appeared on the web (I don’t know its origin) that substitutes every instance of š with x,and every instance of  û with v.  So I replaced the relevant letters with their more modern equivalents.

I extracted the text and dumped it into MS Word, cleaned it up by removing anything that’s not a Manchu word, put all of the words into one column, dumped it into Excel, and then sorted and subtotaled to produce a word frequency list.  I also did a character frequency list.

The character frequencies:

Char Freq Char Freq Char Freq Char Freq
a 4182 o 1690 h 891 v 565
e 3897 g 1645 d 880 f 520
i 3393 r 1577 s 836 y 411
n 2074 m 1261 c 804 w 183
b 1975 t 1021 l 680 x 144
u 1941 k 984 j 569 z 16

Total 32,139 Characters.

Looking at this, also it seems that the Chinese sound [ʣ] is represented 16 times, and that it is romanized as dz.  Scanning through the text, it looks like it is perhaps exclusively used for the name Sun Zi.

[080922 – I noticed that in the word frequency list, “dz” only appears 14 times.  There is another word that appears twice: dzu.  I can’t find this in any dictionary, and the glosses that are given don’t seem to mention it.  Does anybody have an idea what this might mean?]

The top 100 words (of course raw and unlemmatized):

Rank Word Frq Rank Word Frq Rank Word Frq Rank Word Frq
1. be 457 26. adali 30 51 horon 18 76 inenggi 13
2. de 239 27. geren 30 52 ojorongge 18 77 irgen 13
3. i 189 28. mangga 30 53 ejen 17 78 ningge 13
4. cooha 135 29. ume 30 54 sarkv 17 79 ubu 13
5. kai 132 30 ere 29 55 tuwa 17 80 uttu 13
6. oci 101 31 niyalma 29 56 waka 17 81 abkai 12
7. tuttu 87 32 sembi 29 57 emu 16 82 gaifi 12
8. ofi 81 33 doro 28 58 etembi 16 83 inu 12
9. ombi 69 34 etere 28 59 juwan 16 84 mergen 12
10. ba 68 35 gurun 27 60 komso 16 85 sain 12
11. bata 67 36 muterakv 27 61 musei 16 86 tesei 12
12. bi 53 37 jakanaburengge 25 62 neneme 16 87 wesihun 12
13. afara 51 38 na 24 63 alime 15 88 afaci 11
14. akv 51 39 sunja 23 64 hendume 15 89 arga 11
15. ojorakv 48 40 bade 22 65 muke 15 90 bucere 11
16. aisi 46 41 niyalmai 22 66 sara 15 91 etuhun 11
17. urunakv 42 42 seme 22 67 yaya 15 92 gvnin 11
18. serengge 41 43 ojoro 21 68 dz 14 93 ici 11
19. bime 36 44 ergi 20 69 mutembi 14 94 ohode 11
20. coohai 36 45 giyvn 20 70 sun 14 95 sejen 11
21. arbun 35 46 dahame 19 71 yooni 14 96 erebe 10
22. jiyanggiyvn 34 47 ilan 19 72 babe 13 97 eterengge 10
23. urse 34 48 muse 19 73 ceni 13 98 goloi 10
24. ci 32 49 terei 19 74 fiyelen 13 99 goro 10
25. baitalara 31 50 baita 18 75 hvsun 13 100 haksan 10

Someday perhaps I’ll be inspired to add meanings, but for now I’ll just make an observation.  In most word frequency lists that I’ve seen, most of the commonest words are function words.  For a highly inflected language like Manchu, we can expect lexical words (or content words) to appear sooner.  In this list, five of the first twenty are lexical words: cooha (military), bata (enemy), afara (battle), aisi (benefit), and coohai (military (attrib. form)).  Making a word frequency list on a tiny little corpus such as this can give us some insight into what a text is all about.

4 thoughts on “The Art of War — in Manchu!”

  1. si saivn?

    sun dz i coohai doro bithe umesi sain! sinde ambula baniha!
    bi sini asuba (website) be cihalambi!
    acaki
    bucin

  2. Hi
    I’m Jin, a student. I also consulted question on your Facebook page. According to this article, it seems that you support the romanization of Manchu language?
    But as far as I know, there is no standard for transcription.
    May I consult how do you do that?
    Just very curious.

    Thank you.

    Jin

    1. Möllendorff romanized Manchu and that has become the standard that most scholars use. On the internet, a slight modification of his system has come into use, where unused roman letters are used instead of letters with diacritics: x instead of š, v instead of ū. This is far more convenient to type (for those of us who normally type in languages that don’t use too many diacritics).

Leave a Reply

Your email address will not be published. Required fields are marked *